Filenames truncate to only show first character - c++

I'm following this guide from MSDN on how to list the files in a directory (i'm using the current directory). In my case I need to put the information in the message part of my packet (char array of size 1016) to send it to the client. When I print packet.message on both the client and server only the first character of the filenames are shown. What's wrong? Here's a snippet of the relevant section of code:
WIN32_FIND_DATA f;
HANDLE h = FindFirstFile(TEXT("./*.*"), &f);
string file;
int size_needed;
do
{
sprintf(packet.message,"%s", &f.cFileName);
//Send packet
} while(FindNextFile(h, &f));

This is commonly caused by a wide character string being mistakenly treated as an ASCII string. The build is targeting UNICODE and cFileName contains a wide character string, but sprintf() is assuming it is an ASCII string.
FindFirstFile() will be mapped to either FindFirstFileA() or FindFirstFileW() depending if the build is or is not targeting UNICODE.
A solution would be to use FindFirstFileA() and ASCII strings explicitly.
Note that the & is unrequired in the sprintf():
sprintf(packet.message, "%s", f.cFileName);
As the application is consuming strings that are outside of its control (i.e file names) I would recommend using the safer _snprintf() to avoid buffer overruns:
/* From your comment on the question 'packet.message' is a 'char[1016]'
so 'sizeof()' will function correctly. */
if (_snprintf(packet.message, sizeof(packet.message), "%s", f.cFileName) > 0)
{
}

You're using the Unicode version of FindFirstFile, almost guaranteed, Either invoke the narrow version or change the format specifier of your print. Personally I would do the former:
WIN32_FIND_DATAA f;
HANDLE h = FindFirstFileA("./*.*", &f);
string file;
int size_needed;
do
{
sprintf(packet.message,"%s", f.cFileName);
//Send packet
} while(FindNextFileA(h, &f));
FindClose(h);
Alternatively, you can compile with MBCS or regular characters.

As others have mentioned, you are calling the Unicode version of FindFirstFile() and are passing Unicode data to the Ansi sprintf() function. The %s specifier expects Ansi input. You have a few choices to address the issue in your code:
continue using sprintf(), but change the %s specifier to %ls so it will accept Unicode input and convert it to Ansi when writing to your message buffer:
sprintf(packet.message, "%ls", f.cFileName);
This is not ideal, though, because it will use the Ansi encoding of the local machine, which may be different than the Ansi encoding used by the receiving machine.
change your message buffer to use TCHAR instead of char, and then switch to either wsprintf() or _stprintf() instead of sprintf(). Like FindFirstFile(), they will match whatever character format that TCHAR and TEXT() use:
TCHAR message[1016];
wsprintf(packet.message, TEXT("%s"), f.cFileName);
Or:
#include <tchar.h>
_TCHAR message[1016];
_stprintf(packet.message, _T("%s"), f.cFileName);
if you must use a char buffer, then you should accept Unicode data from the API and convert it to UTF-8 for transmission, and then the receiver can convert it back to Unicode and use it as needed.
WIN32_FIND_DATAW f;
HANDLE h = FindFirstFileW(L"./*.*", &f);
if (h)
{
do
{
WideCharToMultiByte(CP_UTF8, 0, f.cFileName, lstrlenW(f.cFileName), packet.message, sizeof(packet.message), NULL, NULL);
//Send packet
} while(FindNextFile(h, &f));
FindClose(h);
}

Related

How to implement console and file input/output functionality to work with UTF-8 encoding in MinGW? [duplicate]

This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.
By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.
The problem with the third method is this:
putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).
Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz
I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.
Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a LoadStringOem() function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte() without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.
As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback).
As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.
Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/) works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font must contain desired characters, and must not be "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.
I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.

Print unicode symbols to cmd [duplicate]

This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.
By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.
The problem with the third method is this:
putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).
Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz
I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.
Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a LoadStringOem() function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte() without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.
As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback).
As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.
Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/) works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font must contain desired characters, and must not be "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.
I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.

what does function setlocale does?

I write a function to convert wstring to string.If I remove the code setlocale(LC_CTYPE, "") the program goes wrong.I refer to cplusplus read the doc.
C string containing the name of a C locale. These are system specific,
but at least the two following locales must exist:
"C" Minimal "C" locale
"" Environment's default locale
If the value of this parameter is NULL, the function does not make any
changes to the current locale, but the name of the current locale is
still returned by the function.
my code here,source code from cplusplus.com(I add some chinese character):
/* wcstombs example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* wcstombs, wchar_t(C) */
#include <locale.h> /* setlocale */
int main()
{
setlocale(LC_CTYPE, "");
const wchar_t str[] = L"中国、wcstombs example";
char buffer[64];
int ret;
printf ("wchar_t string: %ls \n",str);
ret = wcstombs ( buffer, str, sizeof(buffer) );
if (ret==64)
buffer[63]='\0';
if (ret)
printf ("length:%d,multibyte string: %s \n",ret,buffer);
return 0;
}
If I remove the code setlocale(LC_CTYPE, ""),the program does not run as I expect.
My question is :"If I run in different machine,the program will differ? As the doc say,if the locale is "" ,function does not make any changes to the current locale,but the name of the current locale is still returned by the funciton."
Because the current locale in different machine may differ?
Here is a my c++ version of convert wstring with string,while string to wstring do not need function setlocale,and the program runs well:
/*
string converts to wstring
*/
std::wstring s2ws(const std::string& src)
{
std::wstring res = L"";
size_t const wcs_len = mbstowcs(NULL, src.c_str(), 0);
std::vector<wchar_t> buffer(wcs_len + 1);
mbstowcs(&buffer[0], src.c_str(), src.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
/*
wstring converts to string
*/
std::string ws2s(const std::wstring & src)
{
setlocale(LC_CTYPE, "");
std::string res = "";
size_t const mbs_len = wcstombs(NULL, src.c_str(), 0);
std::vector<char> buffer(mbs_len + 1);
wcstombs(&buffer[0], src.c_str(), buffer.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
If the second argument to setlocale is NULL, it does nothing apart from returning the current locale. But you're not doing that. You're sending it a string entirely consisting of a single nil byte, aka "". My setlocale man page says
If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent.
So what this is doing for you is setting the locale to whatever the user has specified or to the system default.
Without running setlocale at all presumably leaves the current locale either uninitialized or NULL on your system, which is why your program fails without that setting.
Two other man pages for stuff you're using say
The behavior of mbstowcs() depends on the LC_CTYPE category of the current locale.
The behavior of wcstombs() depends on the LC_CTYPE category of the current locale.
Presumably these routines are what is failing if you haven't set the locale at all.
I would guess that you probably don't need to run the setlocale statement on every invocation of these routines, but you do need to make sure it's run at least once before running them.
As far as what happens differently depending on the current locale, I believe that would be how exactly the multibyte string is converted to wide characters and vis versa. I think that the man page for those routines leaves it vague because of that difference. Personally, I'd prefer if it set some examples, such as, "if the current locale is C, the multibyte string is ASCII characters." I would guess there's also at least one in which it is interpreted as UTF-8, but I don't know enough about the different locales to say exactly which one that is. There's probably also at least one locale where the multibyte string happened to be another two bytes per character encoding, but C and C++ would still treat it as bytes.
Edit: Thinking about this more, given the characters you added to the example code, it might make sense to explicitly state that using locales that do not support Chinese characters will cause the final printf to report that the length was -1, and this includes the default C locale. In this case, the contents of the buffer is not clearly specified by the standard - at least, my reading of it indicates that the buffer value will probably be all of the characters up to but not including the one that failed to convert. While neither the C++ documentation nor the C documentation state what happens regarding the character that could not be converted. I haven't paid for the official standards, but I do have copies of the last free releases. C++17 defers to C17. C17 also refrains from commenting on this aspect of this function. For wcsrtombs, it explicitly states that the conversion state is unspecified. However, on wcstombs_s, C17 states
If the conversion stops without converting a null wide character and dst is not a null pointer, then a null character is stored into the array pointed to by dst immediately following any multibyte characters already stored.
In my own experiments with the code provided by the OP above, it appears that the wcstombs implementation on Fedora 28 simply refrains from making any further changes to the buffer. That seems to indicate to me, if the exact behavior of the code matters for this situation, it may make sense to use wcstombs_s instead. But at a minimum, you just check to see if the length returned is -1, and if it is, report an error rather than assuming the conversion worked.

Safe C++ std::string to TCHAR * conversion?

I am trying to convert a std::string to a TCHAR* for use in CreateFile(). The code i have compiles, and works, but Visual Studio 2013 comes up with a compiler warning:
warning C4996: 'std::_Copy_impl': Function call with parameters that may be unsafe - this call relies on the caller to check that the passed values are correct. To disable this warning, use -D_SCL_SECURE_NO_WARNINGS. See documentation on how to use Visual C++ 'Checked Iterators'
I understand why i get the warning, as in my code i use std::copy, but I don't want to define D_SCL_SECURE_NO_WARNINGS if at all possible, as they have a point: std::copy is unsafe/unsecure. As a result, I'd like to find a way that doesn't throw this warning.
The code that produces the warning:
std::string filename = fileList->getFullPath(index);
TCHAR *t_filename = new TCHAR[filename.size() + 1];
t_filename[filename.size()] = 0;
std::copy(filename.begin(), filename.end(), t_filename);
audioreader.setFile(t_filename);
audioreader.setfile() calls CreateFile() internally, which is why i need to convert the string.
fileList and audioreader are instances of classes i wrote myself, but I'd rather not change the core implementation of either if at all possible, as it would mean I'd need to change a lot of implementation in other areas of my program, where this conversion only happens in that piece of code. The method I used to convert there was found in a solution i found at http://www.cplusplus.com/forum/general/12245/#msg58523
I've seen something similar in another question (Converting string to tchar in VC++) but i can't quite fathom how to adapt the answer to work with mine as the size of the string isn't constant. All other ways I've seen involve a straight (TCHAR *) cast (or something equally unsafe), which as far as i know about the way TCHAR and other windows string types are defined, is relatively risky as TCHAR could be single byte or multibyte characters depending on UNICODE definition.
Does anyone know a safe, reliable way to convert a std::string to a TCHAR* for use in functions like CreateFile()?
EDIT to address questions in the comments and answers:
Regarding UNICODE being defined or not: The project in VS2013 is a win32 console application, with #undef UNICODE at the top of the .cpp file containing main() - what is the difference between UNICODE and _UNICODE? as i assume the underscore in what Amadeus was asking is significant.
Not directly related to the question but may add perspective: This program is not going to be used outside the UK, so ANSI vs UNICODE does not matter for this. This is part of a personal project to create an audio server and client. As a result you may see some bits referencing network communication. The aim of this program is to get me using Xaudio and winsock. The conversion issue purely deals with the loading of the file on the server-side so it can open it and start reading chunks to transmit. I'm testing with .wav files found in c:/windows/media
Filename encoding: I read the filenames in at runtime by using FindFirstFileA() and FindNextFileA(). The names are retrieved by looking at cFilename in a WIN32_FIND_DATAA structure. They are stored in a vector<string> (wrapped in a unique_ptr if that matters) but that could be changed. I assume this is what Dan Korn means.
More info about the my classes and functions:
The following are spread between AudioReader.h, Audioreader.cpp, FileList.h, FileList.cpp and ClientSession.h. The fragment above is in ClientSession.cpp. Note that in most of my files i declare using namespace std;
shared_ptr<FileList> fileList; //ClientSession.h
AudioReader audioreader; //ClientSession.h
string _storedpath; //FileList.h
unique_ptr<vector<string>> _filenames; //FileList.h
//FileList.cpp
string FileList::getFullPath(int i)
{
string ret = "";
unique_lock<mutex> listLock(listmtx);
if (static_cast<size_t>(i) < _count)
{
ret = _storedpath + _filenames->at(i);
}
else
{
//rather than go out of bounds, return the last element, as returning an error over the network is difficult at present
ret = _storedpath + _filenames->at(_count - 1);
}
return ret;
}
unique_ptr<AudioReader_Impl> audioReaderImpl; //AudioReader.h
//AudioReader.cpp
HRESULT AudioReader::setFile(TCHAR * fileName)
{
return audioReaderImpl->setFile(fileName);
}
HANDLE AudioReader_Impl::fileHandle; //AudioReader.cpp
//AudioReader.cpp
HRESULT AudioReader_Impl::setFile(TCHAR * fileName)
{
fileHandle = CreateFile(fileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
if (fileHandle == INVALID_HANDLE_VALUE)
{
return HRESULT_FROM_WIN32(GetLastError());
}
if (SetFilePointer(fileHandle, 0, NULL, FILE_BEGIN) == INVALID_SET_FILE_POINTER)
{
return HRESULT_FROM_WIN32(GetLastError());
}
return S_OK;
}
If you do not need to support the string containing UTF-8 (or another multi-byte encoding) then simply use the ANSI version of Windows API:
handle = CreateFileA( filename.c_str(), .......)
You might need to rejig your code for this as you have the CreateFile buried in a function that expects TCHAR. That's not advised these days; it's a pain to splatter T versions of everything all over your code and it has flow-on effects (such as std::tstring that someone suggested - ugh!)
There hasn't been any need to support dual compilation from the same source code since about 1998. Windows API has to support both versions for backward compatibility but your own code does not have to.
If you do want to support the string containing UTF-8 (and this is a better idea than using UTF-16 everywhere) then you will need to convert it to a UTF-16 string in order to call the Windows API.
The usual way to do this is via the Windows API function MultiByteToWideChar which is a bit awkward to use correctly, but you could wrap it up in a function:
std::wstring make_wstring( std::string const &s );
that invokes MultiByteToWideChar to return a UTF-16 string that you can then pass to WinAPI functions by using its .c_str() function.
See this codereview thread for a possible implementation of such a function (although note discussion in the answers)
The root of your problem is that you are mixing TCHARs and non-TCHARs. Even if you get it to work on your machine, unless you do it precisely right, it will fail when non-ASCII characters are used in the path.
If you can use std::tstring instead of regular string, then you won't have to worry about format conversions or codepage versus Unicode issues.
If not, you can use conversion functions like MultiByteToWideChar but make sure you understand the encoding used in the source string or it will just make things worse.
Try this instead:
std::string filename = fileList->getFullPath(index);
#ifndef UNICODE
audioreader.setFile(filename.c_str());
#else
std::wstring w_filename;
int len = MultiByteToWideChar(CP_ACP, 0, filename.c_str(), filename.length(), NULL, 0);
if (len > 0)
{
w_filename.resize(len);
MultiByteToWideChar(CP_ACP, 0, filename.c_str(), filename.length(), &w_filename[0], len);
}
audioreader.setFile(w_filename.c_str());
#endif
Alternatively:
std::string filename = fileList->getFullPath(index);
#ifndef UNICODE
audioreader.setFile(filename.c_str());
#else
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> conv;
std::wstring w_filename = conv.from_bytes(filename);
audioreader.setFile(w_filename.c_str());
#endif

Properly print utf8 characters in windows console

This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.
By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.
The problem with the third method is this:
putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).
Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz
I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.
Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a LoadStringOem() function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte() without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.
As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback).
As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.
Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/) works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font must contain desired characters, and must not be "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.
I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.