Reading an input of mixed unicode characters and integers [duplicate] - c++

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?

I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}

Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv

Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.

If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.

It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

Related

How to convert unicode characters to uppercase in C++

I'm learning about unicode in C++ and I have a hard time getting it to work properly. I try to treat the individual characters as uint64_t. It works if all I need it for is to print out the characters, but the problem is that I need to convert them to uppercase. I could store the uppercase letters in an array and simply use the same index as I do for the lowercase letters, but I'm looking for a more elegant solution. I found this similar question but most of the answers used wide characters, which is not something I can use. Here is what I have attempted:
#include <iostream>
#include <locale>
#include <string>
#include <cstdint>
#include <algorithm>
// hacky solution to store a multibyte character in a uint64_t
#define E(c) ((((uint64_t) 0 | (uint32_t) c[0]) << 32) | (uint32_t) c[1])
typedef std::string::value_type char_t;
char_t upcase(char_t ch) {
return std::use_facet<std::ctype<char_t>>(std::locale()).toupper(ch);
}
std::string toupper(const std::string &src) {
std::string result;
std::transform(src.begin(), src.end(), std::back_inserter(result), upcase);
return result;
}
const uint64_t VOWS_EXTRA[]
{
E("å") , E("ä"), E("ö"), E("ij"), E("ø"), E("æ")
};
int main(void) {
char name[5];
std::locale::global(std::locale("sv_SE.UTF8"));
name[0] = (VOWS_EXTRA[3] >> 32) & ~((uint32_t)0);
name[1] = VOWS_EXTRA[3] & ~((uint32_t)0);
name[2] = '\0';
std::cout << toupper(name) << std::endl;
}
I expect this to print out the character IJ but in reality it prints out the same character as it was in the beginning (ij).
(EDIT: OK, so I read more about the unicode support in standard C++ here. It seems like my best bet is to use something like ICU or Boost.locale for this task. C++ essentially treats std::string as a blob of binary data so doesn't seem to be an easy task to uppercase unicode letters properly. I think that my hacky solution using uint64_t isn't in any way more useful than the C++ standard library if not even worse. I'd be grateful for an example on how to achieve the behaviour stated above using ICU.)
Have a look at the ICU User Guide. For simple (single-character) case mapping, you can use u_toupper. For full case mapping, use u_strToUpper. Example code:
#include <unicode/uchar.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
int main() {
UChar32 upper = u_toupper(U'ij');
u_printf("%lC\n", upper);
UChar src = u'ß';
UChar dest[3];
UErrorCode err = U_ZERO_ERROR;
u_strToUpper(dest, 3, &src, 1, NULL, &err);
u_printf("%S\n", dest);
return 0;
}
also if anyone else is looking for it, std::towupper and std::towlower seemed to work fine
https://en.cppreference.com/w/cpp/string/wide/towupper

c++ can't convert string to wstring

I would like to convert a string variable to wstring due to some german characters that cause problem when doing a substr over the variable. The start position is falsified when any these special characters is present before it. (For instance: for "ä" size() returns 2 instead of 1)
I know that the following conversion works:
wstring ws = L"ä";
Since, I am trying to convert a variable, I would like to know if there is an alternative way for it such as
wstring wstr = L"%s"+str //this is syntaxically wrong, but wanted sth alike
Beside that, I have already tried the following example to convert string to wstring:
string foo("ä");
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl;
cout << wfoo.size() << endl;
, but I get errors like
‘wstring_convert’ was not declared in this scope
I am using ubuntu 14.04 and my main.cpp is compiled with cmake. Thanks for your help!
The solution from "hahakubile" worked for me:
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
But the return value is not 100% correct:
string s = "101446012MaßnStörfall PAt #Maßnahme Störfall 00810000100121000102000020100000000000000";
wstring ws2 = s2ws(s);
cout << ws2.size() << endl; // returns 110 which is correct
wcout << ws2.substr(29,40) << endl; // returns #Ma�nahme St�rfall with symbols
I am wondering why it replaced german characters with symbols.
Thanks again!
If you are using Windows/Visual Studio and need to convert a string to wstring you should use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java).
CA2W ca2w(str, CP_UTF8);
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.
The main point is that
string foo("ä")
Is already an error. Start from here and read all answers. And beware, one is very wrong :)

Printing unicode character in C

I got some local language font installed in my system (windows 8 OS). Through character map tool in windows, i got to know the unicode for those characters for that particular font.
All i wanted to print those character in command line through a C program.
For example: Assume greek letter alpha is represented with unicode u+0074.
Taking "u+0074" as an input, i would like my C program to output alpha character
Can anyone help me?
There are several issues. If you're running in a console window, I'd convert the code to UTF-8, and set the code page for the window to 65001. Alternatively, you can use wchar_t (which is UTF-16 on Windows), output via std::wostream and set the code page to 1200. (According the the documentation I've found, at least. I've no experience with this, because my code has had to be portable, and on the other platforms I've worked on, wchar_t has been either some private 32 bit encoding, or UTF-32.)
First you should set TrueType font (Consolas) in console's Properties. Then this code should suffice in your case -
#include <stdio.h>
#include <tchar.h>
#include <iostream>
#include <string>
#include <Windows.h>
#include <fstream>
//for _setmode()
#include <io.h>
#include <fcntl.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
TCHAR tch[1];
tch[0] = 0x03B1;
// Test1 - WriteConsole
HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
if (hStdOut == INVALID_HANDLE_VALUE) return 1;
DWORD dwBytesWritten;
WriteConsole(hStdOut, tch, (DWORD)_tcslen(tch), &dwBytesWritten, NULL);
WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
_setmode(_fileno(stdout), _O_U16TEXT);
// Test2 - wprintf
_tprintf_s(_T("%s\n"),tch);
// Test3 - wcout
wcout << tch << endl;
wprintf(L"\x03B1\n");
if (wcout.bad())
{
_tprintf_s(_T("\nError in wcout\n"));
return 1;
}
return 0;
}
MSDN -
setmode is typically used to modify the default translation mode of
stdin and stdout, but you can use it on any file. If you apply
_setmode to the file descriptor for a stream, call _setmode before performing any input or output operations on the stream.
use the Unicode version of the WriteConsole function.
also, be sure to store the source code as UTF-8 with BOM, which is supported by both g++ and visual c++
Example, assuming that you want to present a greek alpha given its Unicode code in the form "u+03B1" (the code you listed stands for a lowercase "t"):
#include <stdlib.h> // exit, EXIT_FAILURE, wcstol
#include <string> // std::wstring
using namespace std;
#undef UNICODE
#define UNICODE
#include <windows.h>
bool error( char const s[] )
{
::FatalAppExitA( 0, s );
exit( EXIT_FAILURE );
}
namespace stream_handle {
HANDLE const output = ::GetStdHandle( STD_OUTPUT_HANDLE );
} // namespace stream_handle
void write( wchar_t const* const s, int const n )
{
DWORD n_chars_written;
::WriteConsole(
stream_handle::output,
s,
n,
&n_chars_written,
nullptr // overlapped i/o structure
)
|| error( "WriteConsole failed" );
}
int main()
{
wchar_t const input[] = L"u+03B1";
wchar_t const ch = wcstol( input + 2, nullptr, 16 );
wstring const s = wstring() + ch + L"\r\n";
write( s.c_str(), s.length() );
}
In C there is the primitive type of wchar_t which defines a wide-character. There are also corresponding functions like strcat -> wstrcat. Of course it depends on the environment you are using. If you use Visual Studio have a look here.

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}

How can I cin and cout some unicode text?

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.