Xerces-c and cross-platform string literals - c++

I'm porting a code-base that uses Xerces-c for XML processing from Windows/VC++ to Linux/G++.
On Windows, Xerces-c uses wchar_t as the character type XmlCh. This has allowed people to use std::wstring and string literals of L"" syntax.
On Linux/G++, wchar_t is 32-bit and Xerces-c uses unsigned short int (16-bit) as the character type XmlCh.
I've started out along this track:
#ifdef _MSC_VER
using u16char_t = wchar_t;
using u16string_t = std::wstring;
#elif defined __linux
using u16char_t = char16_t;
using u16string_t = std::u16string;
#endif
Unfortunately, char16_t and unsigned short int are not equivalent and their pointers are not implicitly convertible. So passing u"Hello, world." to Xerces functions still results in invalid conversion errors.
It's starting to look like I'm going to have to explicitly cast every string I pass to Xerces functions. But before I do, I wanted to ask if anyone knows a saner way to programme cross-platform Xerces-c code.

The answer is that no, no-one has a good idea on how to do this. For anyone else who finds this question, this is what I came up with:
#ifdef _MSC_VER
#define U16S(x) L##x
#define U16XS(x) L##x
#define XS(x) x
#define US(x) x
#elif defined __linux
#define U16S(x) u##x
#define U16XS(x) reinterpret_cast<const unsigned short *>(u##x)
inline unsigned short *XS(char16_t* x) {
return reinterpret_cast<unsigned short *>(x);
}
inline const unsigned short *XS(const char16_t* x) {
return reinterpret_cast<const unsigned short *>(x);
}
inline char16_t* US(unsigned short *x) {
return reinterpret_cast<char16_t *>(x);
}
inline const char16_t* US(const unsigned short *x) {
return reinterpret_cast<const char16_t*>(x);
}
#include "char16_t_facets.hpp"
#endif
namespace SafeStrings {
#if defined _MSC_VER
using u16char_t = wchar_t;
using u16string_t = std::wstring;
using u16sstream_t = std::wstringstream;
using u16ostream_t = std::wostream;
using u16istream_t = std::wistream;
using u16ofstream_t = std::wofstream;
using u16ifstream_t = std::wifstream;
using filename_t = std::wstring;
#elif defined __linux
using u16char_t = char16_t;
using u16string_t = std::basic_string<char16_t>;
using u16sstream_t = std::basic_stringstream<char16_t>;
using u16ostream_t = std::basic_ostream<char16_t>;
using u16istream_t = std::basic_istream<char16_t>;
using u16ofstream_t = std::basic_ofstream<char16_t>;
using u16ifstream_t = std::basic_ifstream<char16_t>;
using filename_t = std::string;
#endif
char16_t_facets.hpp has definitions of the template specialisations std::ctype<char16_t>, std::numpunct<char16_t>, std::codecvt<char16_t, char, std::mbstate_t>. It's necessary to add these to the global locale, along with std::num_get<char16_t> and std::num_put<char16_t> (but it's not necessary to provide specialisations for these). The code for codecvt is the only bit that's difficult, and a reasonable template can be found in the GCC 5.0 libraries (if you use GCC 5, you don't need to provide the codecvt specialisation as it's already in the library).
Once you've done all of that, the char16_t streams will work correctly.
Then, every time you define a wide string, instead of L"string", write U16S("string"). Every time you pass a string to Xerces, write XS(string.c_str()) or U16XS("string") for literals. Every time you get a string back from Xerces, convert it back as u16string_t(US(call_xerces_function())).
Note that it is also possible to recompile Xerces-C with the character type set to char16_t. This removes a lot of the effort required above. BUT you won't be able to use any other library on the system that in turn depends on Xerces-C. Linking to any such library will cause link errors (because changing the character type changes many of the Xerces function signatures).

Related

Reading UTF-8 characters from console

I'm trying to read UTF-8 encoded polish characters from console for my c++ application.
I'm sure that console uses this code page (checked in properties).
What I have already tried:
Using cin - instead of "zażółć" I read "za\0\0\0\0"
Using wcin - instead of "zażółć" - same result as with cin
Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0'
Using wscanf - same result as with scanf
Using getchar to read characters one by one - same result as with scanf
On the beginning of the main function I have following lines:
setlocale(LC_ALL, "PL_pl.UTF-8");
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
I would be really greatful for help.
Although you’ve already accepted an answer, here’s a more portable version, which sticks closer to the standard library. Unfortunately, this is one area where I’ve found that a lot of widely-used implementations do not support things that are supposedly in the standard. For example, there is supposed to be a standard way to print multi-byte strings (which theoretically could be something unusual like shift-JIS, but in practice are UTF-8 on every modern OS), but it does not actually work portably. Microsoft’s runtime library is especially poor in this regard, but I’ve also found bugs in libc++.
/* Boilerplate feature-test macros: */
#if _WIN32 || _WIN64
# define _WIN32_WINNT 0x0A00 // _WIN32_WINNT_WIN10
# define NTDDI_VERSION 0x0A000002 // NTDDI_WIN10_RS1
# include <sdkddkver.h>
#else
# define _XOPEN_SOURCE 700
# define _POSIX_C_SOURCE 200809L
#endif
#include <iostream>
#include <locale>
#include <locale.h>
#include <stdlib.h>
#include <string>
#ifndef MS_STDLIB_BUGS // Allow overriding the autodetection.
/* The Microsoft C and C++ runtime libraries that ship with Visual Studio, as
* of 2017, have a bug that neither stdio, iostreams or wide iostreams can
* handle Unicode input or output. Windows needs some non-standard magic to
* work around that. This includes programs compiled with MinGW and Clang
* for the win32 and win64 targets.
*
* NOTE TO USERS OF TDM-GCC: This code is known to break on tdm-gcc 4.9.2. As
* a workaround, "-D MS_STDLIB_BUGS=0" will at least get it to compile, but
* Unicode output will still not work.
*/
# if ( _MSC_VER || __MINGW32__ || __MSVCRT__ )
/* This code is being compiled either on MS Visual C++, or MinGW, or
* clang++ in compatibility mode for either, or is being linked to the
* msvcrt (Microsoft Visual C RunTime) library.
*/
# define MS_STDLIB_BUGS 1
# else
# define MS_STDLIB_BUGS 0
# endif
#endif
#if MS_STDLIB_BUGS
# include <io.h>
# include <fcntl.h>
#endif
using std::endl;
using std::istream;
using std::wcin;
using std::wcout;
void init_locale(void)
// Does magic so that wcout can work.
{
#if MS_STDLIB_BUGS
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_WTEXT );
_setmode( _fileno(stdin), _O_WTEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
wcout.imbue(std::locale());
wcin.imbue(std::locale());
#endif
}
int main(void)
{
init_locale();
static constexpr size_t bufsize = 1024;
std::wstring input;
input.reserve(bufsize);
while ( wcin >> input )
wcout << input << endl;
return EXIT_SUCCESS;
}
This reads in wide-character input from the console regardless of its initial locale or code page. If what you meant instead was that the input will be bytes in the UTF-8 encoding (such as from a redirected file in UTF-8 encoding), not console input, the standard way to accomplish this is supposed to be the conversion facet from UTF-8 to wchar_t in <codecvt> and <locale>, but in practice Windows doesn’t support Unicode locales, so you have to read the bytes in and then convert them manually. A more standard way to do that is mbstowcs(). I have some old code to do the conversion for STL iterators, but there are also conversion functions in the standard library. You might need to do this anyway, if for example you need to save or transmit in UTF-8.
There are some who will recommend you store all strings in UTF-8 internally even when using an API like Windows based on some form of UTF-16, converting to another encoding only when you make API calls. I strongly advise you to use UTF-8 externally whenever you possibly can, but I don’t go quite that far. Note, however, that storing strings as UTF-8 will save you a lot of memory, especially on systems where wchar_t is UCS-32. You would have a better idea than I how many bytes this would typically save you for Polish text.
Here is the trick I use for UTF-8 support. The result is multibyte string which could be then used elsewhere:
#include <cstdio>
#include <windows.h>
#define MAX_INPUT_LENGTH 255
int main()
{
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
wchar_t wstr[MAX_INPUT_LENGTH];
char mb_str[MAX_INPUT_LENGTH * 3 + 1];
unsigned long read;
void *con = GetStdHandle(STD_INPUT_HANDLE);
ReadConsole(con, wstr, MAX_INPUT_LENGTH, &read, NULL);
int size = WideCharToMultiByte(CP_UTF8, 0, wstr, read, mb_str, sizeof(mb_str), NULL, NULL);
mb_str[size] = 0;
std::printf("ENTERED: %s\n", mb_str);
return 0;
}
Should look like this:
P.S. Big thanks to Remy Lebeau for pointing out some flaws!

How to get %AppData% path as std::string?

I've read that one can use SHGetSpecialFolderPath(); to get the AppData path. However, it returns a TCHAR array. I need to have an std::string.
How can it be converted to an std::string?
Update
I've read that it is possible to use getenv("APPDATA"), but that it is not available in Windows XP. I want to support Windows XP - Windows 10.
The T type means that SHGetSpecialFolderPath is a pair of functions:
SHGetSpecialFolderPathA for Windows ANSI encoded char based text, and
SHGetSpecialFolderPathW for UTF-16 encoded wchar_t based text, Windows' “Unicode”.
The ANSI variant is just a wrapper for the Unicode variant, and it can not logically produce a correct path in all cases.
But this is what you need to use for char based data.
An alternative is to use the wide variant of the function, and use whatever machinery that you're comfortable with to convert the wide text result to a byte-oriented char based encoding of your choice, e.g. UTF-8.
Note that UTF-8 strings can't be used directly to open files etc. via the Windows API, so this approach involves even more conversion just to use the string.
However, I recommend switching over to wide text, in Windows.
For this, define the macro symbol UNICODE before including <windows.h>.
That's also the default in a Visual Studio project.
https://msdn.microsoft.com/en-gb/library/windows/desktop/dd374131%28v=vs.85%29.aspx
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef unsigned char TCHAR;
#endif
Basically you can can convert this array to std::wstring. Converting to std::string is straightforward with std::wstring_convert.
http://en.cppreference.com/w/cpp/locale/wstring_convert
You should use SHGetSpecialFolderPathA() to have the function deal with ANSI characters explicitly.
Then, just convert the array of char to std::string as usual.
/* to have MinGW declare SHGetSpecialFolderPathA() */
#if !defined(_WIN32_IE) || _WIN32_IE < 0x0400
#undef _WIN32_IE
#define _WIN32_IE 0x0400
#endif
#include <shlobj.h>
#include <string>
std::string getPath(int csidl) {
char out[MAX_PATH];
if (SHGetSpecialFolderPathA(NULL, out, csidl, 0)) {
return out;
} else {
return "";
}
}
Typedef String as either std::string or std::wstring depending on your compilation configuration. The following code might be useful:
#ifndef UNICODE
typedef std::string String;
#else
typedef std::wstring String;
#endif

#define equivalent in c++

g++ 4.7.2
Hello,
I am coming from C89 and now I am doing c++ using g++ compiler.
Normally I do things like this:
#define ARR_SIZE 64
#define DEVICE "DEVICE_64"
What is the equivalent of doing this in C++?
Many thanks for any suggestions,
#define is there in C++. So you can write the same code. But for constant quantities like this, it is better to use the const keyword.
const int ARR_SIZE = 64;
const std::string DEVICE("DEVICE_64");
You can use const in place of #define
const int ARR_SIZE = 64;
const char DEVICE[] = "DEVICE_64";
You can define constants using the const keyword:
const int ARR_SIZE = 64;
const char DEVICE[] = "DEVICE_64";
It’s even better to use anonymous namespace for that (restricted to current file):
namespace {
int const ARR_SIZE = 64;
/* ... */
}
#define is fine !
Excepting type checking, most of C code compile without change with a C++ compiler. So #define is still valid in C++.
you might want to take a look to other stackoverflow entries like :
Should I use #define, enum or const?
What issues can I expect compiling C code with a C++ compiler?

How do I convert from std::wstring _TCHAR []?

I'm using a library and sends me std::wstring from one of its functions, and another library that requires _TCHAR [] to be sent to it. How can I convert it?
Assuming you're using Unicode build, std::wstring.c_str() is what you need. Note that c_str() guarantees that the string it returns is null-terminated.
e.g.
void func(const wchar_t str[])
{
}
std::wstring src;
func(src.c_str());
If you're using non-Unicode build, you'll need to convert the Unicode string to non Unicode string via WideCharToMultiByte.
As #Zach Saw said, if you build only for Unicode you can get away with std::wstring.c_str(), but conteptually it would be better to define a tstring (a typedef for std::basic_string<TCHAR>) so you can safely use this kind of string flawlessly with all the Windows and library functions which expect TCHARs1.
For additional fun you should define also all the other string-related C++ facilities for TCHARs, and create conversion functions std::string/std::wstring <=> tstring.
Fortunately, this work has already been done; see here and here.
Actually no compiled library function can really expect a TCHAR *, since TCHARs are resolved as chars or wchar_ts at compile time, but you got the idea.
Use the ATL and MFC String Conversion Macros. This works regardless of whether you are compiling in _UNICODE or ANSI mode.
You can use these macros even if you aren’t using MFC. Just include the two ATL headers shown in this example:
#include <string>
#include <Windows.h>
#include <AtlBase.h>
#include <AtlConv.h>
int main()
{
std::wstring myString = L"Hello, World!";
// Here is an ATL string conversion macro:
CW2T pszT(myString.c_str());
// pszT is now an object which can be used anywhere a `const TCHAR*`
// is required. For example:
::MessageBox(NULL, pszT, _T("Test MessageBox"), MB_OK);
return 0;
}

_T( ) macro changes for UNICODE character data

I have UNICODE application where in we use _T(x) which is defined as follows.
#if defined(_UNICODE)
#define _T(x) L ##x
#else
#define _T(x) x
#endif
I understand that L gets defined to wchar_t, which will be 4 bytes on any platform. Please correct me if I am wrong. My requirement is that I need L to be 2 bytes. So as compiler hack I started using -fshort-wchar gcc flag. But now I need my application to be moved to zSeries where I don't get to see the effect of -fshort-wchar flag in that platform.
In order for me to be able to port my application on zSeries, I need to modify _T( ) macro in such a way that even after using L ##x and without using -fshort-wchar flag, I need to get 2byte wide character data.Can some one tell me how I can change the definition of L so that I can define L to be 2 bytes always in my application.
You can't - not without c++0x support. c++0x defines the following ways of declaring string literals:
"string of char characters in some implementation defined encoding" - char
u8"String of utf8 chars" - char
u"string of utf16 chars" - char16_t
U"string of utf32 chars" - char32_t
L"string of wchar_t in some implementation defined encoding" - wchar_t
Until c++0x is widely supported, the only way to encode a utf-16 string in a cross platform way is to break it up into bits:
// make a char16_t type to stand in until msvc/gcc/etc supports
// c++0x utf string literals
#ifndef CHAR16_T_DEFINED
#define CHAR16_T_DEFINED
typedef unsigned short char16_t;
#endif
const char16_t strABC[] = { 'a', 'b', 'c', '\0' };
// the same declaration would work for a type that changes from 8 to 16 bits:
#ifdef _UNICODE
typedef char16_t TCHAR;
#else
typedef char TCHAR;
#endif
const TCHAR strABC2[] = { 'a', 'b', 'b', '\0' };
The _T macro can only deliver the goods on platforms where wchar_t's are 16bits wide. And, the alternative is still not truly cross-platform: The coding of char and wchar_t is implementation defined so 'a' does not necessarily encode the unicode codepoint for 'a' (0x61). Thus, to be strictly accurate, this is the only way of writing the string:
const TCHAR strABC[] = { '\x61', '\x62', '\x63', '\0' };
Which is just horrible.
Ah! The wonders of portability :-)
If you have a C99 compiler for all your platforms, use int_least16_t, uint_least16_t, ... from <stdint.h>. Most platforms also define int16_t but it's not required to exist (if the platform is capable of using exactly 16 bits at a time, the typedef int16_t must be defined).
Now wrap all the strings in arrays of uint_least16_t and make sure your code does not expect values of uint_least16_t to wrap at 65535 ...