I'm creating a DLL for a application. The application calls the DLL and receives a string from 8 to 50 in length.
The problem I'm having is that only the first letter of any message the application receives is shown.
Below is the GetMethodVersion function.
#include "stdafx.h"
STDAPI_(void) GetMethodVersion(LPTSTR out_strMethodVersion, int in_intSize)
{
if ((int)staticMethodVersion.length() > in_intSize)
return;
_tcscpy_s(out_strMethodVersion, 12, _T("Test"));
//staticMethodVersion should be insted of _T("Test")
}
The project settings are set to Unicode.
I belive after some research that there is a problem with Unicode format and how it functions. Thanks for any help you can give.
You wrote in your question that the project settings are Unicode: is this true for both the DLL and the calling EXE? Make sure that they both match.
In Unicode builds, the ugly TCHAR macros become:
LPTSTR --> wchar_t*
_tcscpy_s --> wcscpy_s
_T("Test") --> L"Test"
So you have:
STDAPI_(void) GetMethodVersion(wchar_t* out_strMethodVersion,
int in_intSize)
{
...
wcscpy_s(out_strMethodVersion, 12, L"Test");
}
Are you sure the "magic number" 12 is correct? Is the destination string buffer pointed to by out_strMethodVersion of size at least 12 wchar_ts (including the terminating NUL)?
Then, have a look at the call site (which you haven't showed).
How do you print the returned string? Maybe you are using an ANSI char function, so the returned string is misinterpreted as a char* ANSI string, and so the first 0x00 byte of the Unicode UTF-16 string is misinterpreted as a NUL-terminator at the call site, and the string gets truncated at the first character when printed?
Text: T e s t NUL
UTF-16 bytes: 54 00 65 00 73 00 74 00 00 00
(hex) **<--+
|
First 00 byte misinterpreted as
NUL terminator in char* ANSI string,
so only 'T' (the first character) gets printed.
EDIT
The fact that you clarified in the comments that:
I switched the DLL to ANSI, the EXE apparently was that as well, though the exe was documented as Unicode.
makes me think that the EXE assumes the UTF-8 Unicode encoding.
Just as in ANSI strings, a 0x00 byte in UTF-8 is a string NUL terminator, so the previous analysis of UTF-16 0x00 byte (in a wchar_t) misinterpreted as string NUL terminator applies.
Note that pure ASCII is a proper subset of UTF-8: so your code may work if you just use pure ASCII characters (like in "Test") and pass them to the EXE.
However, if the EXE is documented to be using Unicode UTF-8, you may want to Do The Right Thing and return a UTF-8 string from the DLL.
The string is returned via char* (as for ANSI strings), but it's important that you make sure that UTF-8 is the encoding used by the DLL to return that string, to avoid subtle bugs in the future.
While the general terminology used in Windows APIs and Visual Studio is "Unicode", it actually means the UTF-16 Unicode encoding in those contexts.
However, UTF-16 is not the only Unicode encoding available. For example, to exchange text on the Internet, the UTF-8 encoding is widely used. In your case, it sounds like your EXE is expecting a Unicode UTF-8 string.
It is to late to #define UNICODE after #include "stdafx.h". It should be defined before the first #include in stdafx.h itself. But the proper way is to set it in project properties (menu Project > Properties > Configuration Properties > General > Character Set > "Use Unicode Character Set").
Related
I'm trying to print all chinese character with c++.
my project uses wchar_t for handling unicode string.
wikipedia said: chinese chars are located in:
4E00–62FF
6300–77FF
7800–8CFF
8D00–9FFF
3400–4DBF
20000–215FF
21600–230FF
23100–245FF
24600–260FF
26100–275FF
27600–290FF
29100–2A6DF
2A700–2B73F
2B740–2B81F
2B820–2CEAF
2CEB0–2EBEF
30000–3134F
but my code said wchar_t is 2byte.
this means wchar_t support 0x0000 - 0xFFFF.
how can I print unicode char over 0xFFFF? should I use another thing (such as wstring?) rather than wchar_t?
thanks in advance.
This question already has answers here:
Is it possible to cout an EM DASH on Linux and Windows? [duplicate]
(2 answers)
Closed 5 years ago.
A simple problem: I'm writing a chatroom program in C++ (but it's primarily C-style) for a class, and I'm trying to print, “#help — display a list of commands...” to the output window. While I could use two hyphens (--) to achieve roughly the same effect, I'd rather use an em-dash (—). printf(), however, doesn't seem to support printing em-dashes. Instead, the console just prints out the character, ù, in its place, despite the fact that entering em-dashes directly into the prompt works fine.
How do I get this simple Unicode character to show up?
Looking at Windows alt key codes, I find it interesting how alt+0151 is "—" and alt+151 is "ù". Is this related to my problem, or a simple coincidence?
the windows is unicode (UTF-16) system. console unicode as well. if you want print unicode text - you need (and this is most effective) use WriteConsoleW
BOOL PrintString(PCWSTR psz)
{
DWORD n;
return WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), psz, (ULONG)wcslen(psz), &n, 0);
}
PrintString(L"—");
in this case in your binary file will be wide character — (2 bytes 0x2014) and console print it as is.
if ansi (multi-byte) function is called for output console - like WriteConsoleA or WriteFile - console first translate multi-byte string to unicode via MultiByteToWideChar and in place CodePage will be used value returned by GetConsoleOutputCP. and here (translation) can be problem if you use characters > 0x80
first of all compiler can give you warning: The file contains a character that cannot be represented in the current code page (number). Save the file in Unicode format to prevent data loss. (C4819). but even after you save source file in Unicode format, can be next:
wprintf(L"ù"); // no warning
printf("ù"); //warning C4566
because L"ù" saved as wide char string (as is) in binary file - here all ok and no any problems and warning. but "ù" is saved as char string (single byte string). compiler need convert wide string "ù" from source file to multi-byte string in binary (.obj file, from which linker create pe than). and compiler use for this WideCharToMultiByte with CP_ACP (The current system default Windows ANSI code page.)
so what happens if you say call printf("ù"); ?
unicode string "ù" will be converted to multi-byte
WideCharToMultiByte(CP_ACP, ) and this will be at compile time. resulting multi-byte string will be saved in binary file
the console it run-time convert your multi-byte string to
wide char by MultiByteToWideChar(GetConsoleOutputCP(), ..) and
print this string
so you got 2 conversions: unicode -> CP_ACP -> multi-byte -> GetConsoleOutputCP() -> unicode
by default GetConsoleOutputCP() == CP_OEMCP != CP_ACP even if you run program on computer where you compile it. (on another computer with another CP_OEMCP especially)
problem in incompatible conversions - different code pages used. but even if you change console code page to your CP_ACP - convertion anyway can wrong translate some characters.
and about CRT api wprintf - here situation is next:
the wprintf first convert given string from unicode to multi-byte by using it internal current locale (and note that crt locale independent and different from console locale). and then call WriteFile with multi-byte string. console convert back this multi-bytes string to unicode
unicode -> current_crt_locale -> multi-byte -> GetConsoleOutputCP() -> unicode
so for use wprintf we need first set current crt locale to GetConsoleOutputCP()
char sz[16];
sprintf(sz, ".%u", GetConsoleOutputCP());
setlocale(LC_ALL, sz);
wprintf(L"—");
but anyway here i view (on my comp) - on screen instead —. so will be -— if call PrintString(L"—"); (which used WriteConsoleW) just after this.
so only reliable way print any unicode characters (supported by windows) - use WriteConsoleW api.
After going through the comments, I've found eryksun's solution to be the simplest (...and the most comprehensible):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
int main()
{
//other stuff
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"#help — display a list of commands...");
Portability isn't a concern of mine, and this solves my initial problem—no more ù—my beloved em-dash is on display.
I acknowledge this question is essentially a duplicate of the one linked by sata300.de. Albeit, with printf in the place of cout, and unnecessary ramblings in the place of relevant information.
I'm trying to convert text containing Hiragana from a wstring to a QString, so that it can be used on a label's text property. However my code is not working and I'm not sure why that is.
The following conversion method obviously tells me that I made something wrong:
std::wstring myWString = L"Some Hiragana: あ い う え お";
ui->label->setText(QString::fromStdWString(myWString));
Output: Some Hiragana: ゠ㄠㆠ㈠ãŠ
I can print Hiragana on a label if I put them in a string directly:
ui->label->setText("Some Hiragana: あ い う え お");
Output: Some Hiragana: あ い う え お
That means I can avoid this problem by simply using std::string instead of std::wstring, but I'd like to know why this is happening.
VS is interpreting the file as Windows-1252 instead of UTF-8.
As an example, 'あ' in UTF-8 is E3 81 82, but the compiler is reading each byte as a single Windows-1252 char before converting it to the respective UTF-16 codepoints E3 201A, which works out as 'ã‚' (81 is either ignored by VS as it is reserved in Windows-1252, or not printed by qt if VS happens to convert it to the respective C1 control character).
The direct version works because the compiler doesn't perform any conversions and leaves the string as E3 81 82.
To fix your issue you will need to inform VS that the file is UTF-8, according to other posts one way is to ensure the file has a UTF-8 BOM.
The only portable way of fixing this is to use escape sequences instead:
L"Some Hiragana: \u3042 \u3044 \u3046 \u3048 \u304A"
Per MSDN:
"For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII."
C++03
2.1 Phases of translation
"..Any source file character not in the basic source character set
(2.2) is replaced by the universal-character-name that designates that
character. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)"
2.13.2 Character literals
"A universal-character-name is translated to the encoding, in the
execution character set, of the character named. If there is no such
encoding, the universal-character-name is translated to an
implementation-defined encoding."
To test which execution character set is used by MSVC++, I wrote the following code:
wchar_t *str = L"中";
unsigned char *p = reinterpret_cast<unsigned char*>(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
printf ("%x ", *(p + i));
}
The output shows that 2d 4e 0 0, and 0x4e2d is the UTF-16 encoding of this Chinese character. So I conclude: UTF-16 is used as execution character set by MSVC (My version: 2012 4.5.50709)
After, I tried to print this character out to a Windows console. Since the default locale used by console is "C", I set the locale to code page 936 representing simplified Chinese characters.
// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls\n", str);
Which outputs:
中
What I'm curious about is, how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16(MS code page 936)? How can that happen?
how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16
There are two ways you can write text to the console. The byte way, using the Win32 API WriteConsoleA, gives you characters from bytes interpreted using the console's code page ("ANSI"). The Unicode way, WriteConsoleW, receives a UTF-16LE string and writes the characters to the console directly without having to worry about what code page it is using.
The stdio function printf uses WriteConsoleA when the output is an interactive console. The wprintf function, from VS 2005 on at least, calls WriteConsoleW.
I think I get it.
In Microsoft C++ 2008(probably 2005+), CRT functions as wprintf, wcout are implemented such that they convert a wide string literal as L"中" encoded in UTF-16, under the hood, to match the current locale/code page setting. So what happens here is that L"中" is converted to bytes D6 D0 in code page 936 for simplified Chinese.
I was wrong that setlocale set the console code page. It just set the current program code page which is used by CRT functions during the "conversion". For changing console code page, command chcp or Win API SetConsoleOputputCP() achieves.
Since my console's default page is 936, that character can be correctly shown w/o problem.
Why does this code:
char a[10];
wchar_t w[10] = L"ä"; // German a Umlaut
int e = wcstombs(a, w, 10);
return e == -1?
I am using Oracle Solaris Studio 10 on Solaris 11. The locale is Latin-1, which contains the German Umlauts. All docs I have found indicate (to me) that the conversion should succeed.
If I do this:
char a[10] = "ä"; // German a Umlaut
wchar_t w[10];
int e = mbstowcs(w, a, 10);
e = wcstombs(a, w, 10);
there is no error, but the result is wrong. (Some variant of upper A.)
I also tried wstostr with similar result.
1) verify that the correct value is getting into the wchar_t. The compiler producing the wide character string literal has to convert L"ä" from the source code encoding to the wide execution charset.
2) verify that the program's locale is correct. You can do this with printf("%s\n", setlocale(LC_ALL, NULL));
I suspect that the problem is 1) because for me even if the program's locale isn't set correctly I still get the expected output. To avoid problems with the source code encoding you can escape non-ascii characters like L"\x00E4".
1 #include <iostream>
2 #include <clocale>
3
4 int main () {
5 std::printf("%s\n", std::setlocale(LC_ALL, NULL)); // prints "C"
6
7 char a[10];
8 wchar_t w[10] = L"\x00E4"; // German a Umlaut
9 std::printf("0x%04x\n", (unsigned)w[0]); // prints "0x00e4"
10
11 std::setlocale(LC_ALL, "");
12 printf("%s\n", std::setlocale(LC_ALL, NULL)); // print something that indicates the encoding is ISO 8859-1
13 int e = std::wcstombs(a, w, 10);
14 std::printf("%i 0x%02x\n", e, (unsigned char)a[0]); // print "1 0xe4"
15 }
16
Character Sets in C and C++ Programs
In your source code you can use any character from the 'source character set', which is a superset of the 'basic source character set'. The compiler will convert characters in string and character literals from the source character set into the execution character set (or wide execution character set for wide string and character literals).
The issue is that the source character set is implementation dependent. Typically the compiler simply has to know what encoding you use for the source code and then it will accept any characters from that encoding. GCC has command line arguments for setting the source encoding, Visual Studio will assume that the source is in the user's codepage unless it detects one of the so-called Unicode signatures for UTF-8 or UTF-16, and Clang currently always uses UTF-8.
Once the compiler is using the right source character set for your code it will then produce string and character literals in the 'execution character set'. The execution character set is another superset of the basic source character set, and is also implementation dependent. GCC takes a command line argument to set the execution character set, VS uses the user's locale, and Clang uses UTF-8.
Because the source character set is implementation dependent, the portable way to write characters outside the basic set is to either use hex encoding to directly specify the numeric values to be used in execution, or (if you're not using C89/90) to use universal character names (UCNs), which are converted to the execution character set (or wide execution character set when used in wide string and character literals). UCNs look like \uNNNN or \UNNNNNNNN and specify the character from the Unicode character set with the code point value NNNN or NNNNNNNN. (Note that C99 and C++11 prohibit you from using surrogate code points, if you want a character from outside the BMP just directly write the character's value using \U.)
The source and execution character sets are determined at compile time and do not change based on the locale of the system running the program. That is, the program locale uses another encoding not necessarily matching the execution character set. The wide execution character set should correspond to the wide character encoding used by supported locales, however.
Solaris Studio's behavior
Oracle's compiler for Solaris has very simple behavior. For narrow string and character literals no particular source encoding is specified, bytes from the source code are simply used directly as the execution literal. This effectively means that the execution character set is the same as the encoding of the source files. For wide character literals the source bytes are converted using the system locale. This means that you have to save the source file using the locale encoding in order to get correct wide literals.
I suspect that your source code is being saved in an encoding other than the one specified by the locale, so your compiler was failing to produce the correct wide string literal from L"ä". Your editor might be using UTF-8. You can check using the following program.
1 #include <iostream>
2 #include <clocale>
3
4 int main () {
5 wchar_t w[10] = L"ä"; // German a Umlaut
6 std::printf("0x%04x 0x%04x\n", (unsigned)w[0], (unsigned)w[1]);
7 }
8
Since wcstombs can correctly convert the wide character 0x00E4 to the latin-1 encoding of 'ä' you want the above to display 0x00E4 0x0000. If the source code encoding is UTF-8 then you should see 0x00C3 0x00A4.
You may have to set the locale to understand German. Specifically you want the ctype facet.
Try this:
setlocale( LC_ALL, ".1252" );
or specifically this:
setlocale( LC_CTYPE, ".1252" );
You may have to search for a better codepage than ".1252". Good luck.
The codepage examples above are Windows. On Unixy systems try "de_DE" for the codepage.