Reading UTF-8 characters from console - c++

I'm trying to read UTF-8 encoded polish characters from console for my c++ application.
I'm sure that console uses this code page (checked in properties).
What I have already tried:
Using cin - instead of "zażółć" I read "za\0\0\0\0"
Using wcin - instead of "zażółć" - same result as with cin
Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0'
Using wscanf - same result as with scanf
Using getchar to read characters one by one - same result as with scanf
On the beginning of the main function I have following lines:
setlocale(LC_ALL, "PL_pl.UTF-8");
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
I would be really greatful for help.

Although you’ve already accepted an answer, here’s a more portable version, which sticks closer to the standard library. Unfortunately, this is one area where I’ve found that a lot of widely-used implementations do not support things that are supposedly in the standard. For example, there is supposed to be a standard way to print multi-byte strings (which theoretically could be something unusual like shift-JIS, but in practice are UTF-8 on every modern OS), but it does not actually work portably. Microsoft’s runtime library is especially poor in this regard, but I’ve also found bugs in libc++.
/* Boilerplate feature-test macros: */
#if _WIN32 || _WIN64
# define _WIN32_WINNT 0x0A00 // _WIN32_WINNT_WIN10
# define NTDDI_VERSION 0x0A000002 // NTDDI_WIN10_RS1
# include <sdkddkver.h>
#else
# define _XOPEN_SOURCE 700
# define _POSIX_C_SOURCE 200809L
#endif
#include <iostream>
#include <locale>
#include <locale.h>
#include <stdlib.h>
#include <string>
#ifndef MS_STDLIB_BUGS // Allow overriding the autodetection.
/* The Microsoft C and C++ runtime libraries that ship with Visual Studio, as
* of 2017, have a bug that neither stdio, iostreams or wide iostreams can
* handle Unicode input or output. Windows needs some non-standard magic to
* work around that. This includes programs compiled with MinGW and Clang
* for the win32 and win64 targets.
*
* NOTE TO USERS OF TDM-GCC: This code is known to break on tdm-gcc 4.9.2. As
* a workaround, "-D MS_STDLIB_BUGS=0" will at least get it to compile, but
* Unicode output will still not work.
*/
# if ( _MSC_VER || __MINGW32__ || __MSVCRT__ )
/* This code is being compiled either on MS Visual C++, or MinGW, or
* clang++ in compatibility mode for either, or is being linked to the
* msvcrt (Microsoft Visual C RunTime) library.
*/
# define MS_STDLIB_BUGS 1
# else
# define MS_STDLIB_BUGS 0
# endif
#endif
#if MS_STDLIB_BUGS
# include <io.h>
# include <fcntl.h>
#endif
using std::endl;
using std::istream;
using std::wcin;
using std::wcout;
void init_locale(void)
// Does magic so that wcout can work.
{
#if MS_STDLIB_BUGS
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_WTEXT );
_setmode( _fileno(stdin), _O_WTEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
wcout.imbue(std::locale());
wcin.imbue(std::locale());
#endif
}
int main(void)
{
init_locale();
static constexpr size_t bufsize = 1024;
std::wstring input;
input.reserve(bufsize);
while ( wcin >> input )
wcout << input << endl;
return EXIT_SUCCESS;
}
This reads in wide-character input from the console regardless of its initial locale or code page. If what you meant instead was that the input will be bytes in the UTF-8 encoding (such as from a redirected file in UTF-8 encoding), not console input, the standard way to accomplish this is supposed to be the conversion facet from UTF-8 to wchar_t in <codecvt> and <locale>, but in practice Windows doesn’t support Unicode locales, so you have to read the bytes in and then convert them manually. A more standard way to do that is mbstowcs(). I have some old code to do the conversion for STL iterators, but there are also conversion functions in the standard library. You might need to do this anyway, if for example you need to save or transmit in UTF-8.
There are some who will recommend you store all strings in UTF-8 internally even when using an API like Windows based on some form of UTF-16, converting to another encoding only when you make API calls. I strongly advise you to use UTF-8 externally whenever you possibly can, but I don’t go quite that far. Note, however, that storing strings as UTF-8 will save you a lot of memory, especially on systems where wchar_t is UCS-32. You would have a better idea than I how many bytes this would typically save you for Polish text.

Here is the trick I use for UTF-8 support. The result is multibyte string which could be then used elsewhere:
#include <cstdio>
#include <windows.h>
#define MAX_INPUT_LENGTH 255
int main()
{
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
wchar_t wstr[MAX_INPUT_LENGTH];
char mb_str[MAX_INPUT_LENGTH * 3 + 1];
unsigned long read;
void *con = GetStdHandle(STD_INPUT_HANDLE);
ReadConsole(con, wstr, MAX_INPUT_LENGTH, &read, NULL);
int size = WideCharToMultiByte(CP_UTF8, 0, wstr, read, mb_str, sizeof(mb_str), NULL, NULL);
mb_str[size] = 0;
std::printf("ENTERED: %s\n", mb_str);
return 0;
}
Should look like this:
P.S. Big thanks to Remy Lebeau for pointing out some flaws!

Related

How to port Windows C++ that handles unicode with tchar.h to iOS app

I have some c++ code that I need to integrate into an iOS app. The windows c++ code handles unicode using tchar.h. I have made the following defines for iOS:
#include <wchar.h>
#define _T(x) x
#define TCHAR char
#define _tremove unlink
#define _stprintf sprintf
#define _sntprintf vsnprintf
#define _tcsncpy wcsncpy
#define _tcscpy wcscpy
#define _tcscmp wcscmp
#define _tcsrchr wcsrchr
#define _tfopen fopen
When trying to build the app many of these are either missing (ex. wcscpy) or have the wrong arguments. The coder responsible for the c++ code said I should use char instead of wchar so I defined TCHAR as char. Does anyone have a clue as to how this should be done?
The purpose of TCHAR (and FYI, it is _TCHAR in the C runtime, TCHAR is for the Win32 API) is to allow code to switch between either char or wchar_t APIs at compile time, but your defines are mixing them together. The wcs functions are for wchar_t, so you need to change those defines to the char counterparts, to match your other char-based defines:
#define _tcsncpy strncpy
#define _tcscpy strcpy
#define _tcscmp strcmp
#define _tcsrchr strrchr
Also, you are mapping _sntprintf to the wrong C function. It needs to be mapped to snprintf() instead of vsnprintf():
#define _sntprintf snprintf
snprintf() and vsnprintf() are declared very differently:
int snprintf ( char * s, size_t n, const char * format, ... );
int vsnprintf (char * s, size_t n, const char * format, va_list arg );
Which is likely why you are getting "wrong arguments" errors.

which is a good practice to use access function

I have the following code that I want to work also on Linux with GCC 4.8
This is working with VS 2013
if ( _access( trigger->c_str(), 0 ) != -1 )
{
...
}
I know that on Linux I can use function: access from "unistd.h"
Is there a way to avoid having something like the following ( a more elegant solution ) ?
#ifdef __linux__
#include <unistd.h>
#endif
#ifdef __linux__
if ( access( trigger->c_str(), 0 ) != -1 )
{
...
}
#else
if ( _access( trigger->c_str(), 0 ) != -1 )
{
...
}
#endif
A solution that has no duplication, nor relies on a macro definition (besides the predefined one for platform detection) but has slightly more boilerplate than Aracthor's solution:
#ifdef _WIN32
inline int access(const char *pathname, int mode) {
return _access(pathname, mode);
}
#else
#include <unistd.h>
#endif
I prefer to detect windows, and use posix as fall back, because windows tends to be the exception more often than linux.
Another clean solution would be to define _CRT_NONSTDC_NO_WARNINGS and keep using the POSIX standard access in windows, without warnings about deprecation. As a bonus, this also disables warnings for using the standard strcpy instead of strcpy_s and similar. The latter is also standard (in C11), but optional and hardly any other C library implements them (and also, not all of the _s family functions in msvc comply to C11).
There is another way, header-only solution.
#ifdef __linux__
#include <unistd.h>
#else
#define access _access
#endif
if ( access( trigger->c_str(), 0 ) != -1 )
{
...
}
It would include the right file on Linux systems and replace access with _access on other systems.

How to get %AppData% path as std::string?

I've read that one can use SHGetSpecialFolderPath(); to get the AppData path. However, it returns a TCHAR array. I need to have an std::string.
How can it be converted to an std::string?
Update
I've read that it is possible to use getenv("APPDATA"), but that it is not available in Windows XP. I want to support Windows XP - Windows 10.
The T type means that SHGetSpecialFolderPath is a pair of functions:
SHGetSpecialFolderPathA for Windows ANSI encoded char based text, and
SHGetSpecialFolderPathW for UTF-16 encoded wchar_t based text, Windows' “Unicode”.
The ANSI variant is just a wrapper for the Unicode variant, and it can not logically produce a correct path in all cases.
But this is what you need to use for char based data.
An alternative is to use the wide variant of the function, and use whatever machinery that you're comfortable with to convert the wide text result to a byte-oriented char based encoding of your choice, e.g. UTF-8.
Note that UTF-8 strings can't be used directly to open files etc. via the Windows API, so this approach involves even more conversion just to use the string.
However, I recommend switching over to wide text, in Windows.
For this, define the macro symbol UNICODE before including <windows.h>.
That's also the default in a Visual Studio project.
https://msdn.microsoft.com/en-gb/library/windows/desktop/dd374131%28v=vs.85%29.aspx
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef unsigned char TCHAR;
#endif
Basically you can can convert this array to std::wstring. Converting to std::string is straightforward with std::wstring_convert.
http://en.cppreference.com/w/cpp/locale/wstring_convert
You should use SHGetSpecialFolderPathA() to have the function deal with ANSI characters explicitly.
Then, just convert the array of char to std::string as usual.
/* to have MinGW declare SHGetSpecialFolderPathA() */
#if !defined(_WIN32_IE) || _WIN32_IE < 0x0400
#undef _WIN32_IE
#define _WIN32_IE 0x0400
#endif
#include <shlobj.h>
#include <string>
std::string getPath(int csidl) {
char out[MAX_PATH];
if (SHGetSpecialFolderPathA(NULL, out, csidl, 0)) {
return out;
} else {
return "";
}
}
Typedef String as either std::string or std::wstring depending on your compilation configuration. The following code might be useful:
#ifndef UNICODE
typedef std::string String;
#else
typedef std::wstring String;
#endif

Xerces-c and cross-platform string literals

I'm porting a code-base that uses Xerces-c for XML processing from Windows/VC++ to Linux/G++.
On Windows, Xerces-c uses wchar_t as the character type XmlCh. This has allowed people to use std::wstring and string literals of L"" syntax.
On Linux/G++, wchar_t is 32-bit and Xerces-c uses unsigned short int (16-bit) as the character type XmlCh.
I've started out along this track:
#ifdef _MSC_VER
using u16char_t = wchar_t;
using u16string_t = std::wstring;
#elif defined __linux
using u16char_t = char16_t;
using u16string_t = std::u16string;
#endif
Unfortunately, char16_t and unsigned short int are not equivalent and their pointers are not implicitly convertible. So passing u"Hello, world." to Xerces functions still results in invalid conversion errors.
It's starting to look like I'm going to have to explicitly cast every string I pass to Xerces functions. But before I do, I wanted to ask if anyone knows a saner way to programme cross-platform Xerces-c code.
The answer is that no, no-one has a good idea on how to do this. For anyone else who finds this question, this is what I came up with:
#ifdef _MSC_VER
#define U16S(x) L##x
#define U16XS(x) L##x
#define XS(x) x
#define US(x) x
#elif defined __linux
#define U16S(x) u##x
#define U16XS(x) reinterpret_cast<const unsigned short *>(u##x)
inline unsigned short *XS(char16_t* x) {
return reinterpret_cast<unsigned short *>(x);
}
inline const unsigned short *XS(const char16_t* x) {
return reinterpret_cast<const unsigned short *>(x);
}
inline char16_t* US(unsigned short *x) {
return reinterpret_cast<char16_t *>(x);
}
inline const char16_t* US(const unsigned short *x) {
return reinterpret_cast<const char16_t*>(x);
}
#include "char16_t_facets.hpp"
#endif
namespace SafeStrings {
#if defined _MSC_VER
using u16char_t = wchar_t;
using u16string_t = std::wstring;
using u16sstream_t = std::wstringstream;
using u16ostream_t = std::wostream;
using u16istream_t = std::wistream;
using u16ofstream_t = std::wofstream;
using u16ifstream_t = std::wifstream;
using filename_t = std::wstring;
#elif defined __linux
using u16char_t = char16_t;
using u16string_t = std::basic_string<char16_t>;
using u16sstream_t = std::basic_stringstream<char16_t>;
using u16ostream_t = std::basic_ostream<char16_t>;
using u16istream_t = std::basic_istream<char16_t>;
using u16ofstream_t = std::basic_ofstream<char16_t>;
using u16ifstream_t = std::basic_ifstream<char16_t>;
using filename_t = std::string;
#endif
char16_t_facets.hpp has definitions of the template specialisations std::ctype<char16_t>, std::numpunct<char16_t>, std::codecvt<char16_t, char, std::mbstate_t>. It's necessary to add these to the global locale, along with std::num_get<char16_t> and std::num_put<char16_t> (but it's not necessary to provide specialisations for these). The code for codecvt is the only bit that's difficult, and a reasonable template can be found in the GCC 5.0 libraries (if you use GCC 5, you don't need to provide the codecvt specialisation as it's already in the library).
Once you've done all of that, the char16_t streams will work correctly.
Then, every time you define a wide string, instead of L"string", write U16S("string"). Every time you pass a string to Xerces, write XS(string.c_str()) or U16XS("string") for literals. Every time you get a string back from Xerces, convert it back as u16string_t(US(call_xerces_function())).
Note that it is also possible to recompile Xerces-C with the character type set to char16_t. This removes a lot of the effort required above. BUT you won't be able to use any other library on the system that in turn depends on Xerces-C. Linking to any such library will cause link errors (because changing the character type changes many of the Xerces function signatures).

How do I convert from std::wstring _TCHAR []?

I'm using a library and sends me std::wstring from one of its functions, and another library that requires _TCHAR [] to be sent to it. How can I convert it?
Assuming you're using Unicode build, std::wstring.c_str() is what you need. Note that c_str() guarantees that the string it returns is null-terminated.
e.g.
void func(const wchar_t str[])
{
}
std::wstring src;
func(src.c_str());
If you're using non-Unicode build, you'll need to convert the Unicode string to non Unicode string via WideCharToMultiByte.
As #Zach Saw said, if you build only for Unicode you can get away with std::wstring.c_str(), but conteptually it would be better to define a tstring (a typedef for std::basic_string<TCHAR>) so you can safely use this kind of string flawlessly with all the Windows and library functions which expect TCHARs1.
For additional fun you should define also all the other string-related C++ facilities for TCHARs, and create conversion functions std::string/std::wstring <=> tstring.
Fortunately, this work has already been done; see here and here.
Actually no compiled library function can really expect a TCHAR *, since TCHARs are resolved as chars or wchar_ts at compile time, but you got the idea.
Use the ATL and MFC String Conversion Macros. This works regardless of whether you are compiling in _UNICODE or ANSI mode.
You can use these macros even if you aren’t using MFC. Just include the two ATL headers shown in this example:
#include <string>
#include <Windows.h>
#include <AtlBase.h>
#include <AtlConv.h>
int main()
{
std::wstring myString = L"Hello, World!";
// Here is an ATL string conversion macro:
CW2T pszT(myString.c_str());
// pszT is now an object which can be used anywhere a `const TCHAR*`
// is required. For example:
::MessageBox(NULL, pszT, _T("Test MessageBox"), MB_OK);
return 0;
}