conversion of utf-8 string to hex number strange problem - c++

I need a reliable way to convert utf-8 string to hex. (I work with latex encoding)
My initial unicode string as I see from vs debug is
std::string symb = вЋ§;
And I know that in latex compiler (and adobe illustrator) this thing corresponds to the hex representation "f8f1".
It seems to me I have a problem with signed and unsigned int somewhere.
What I do is as follows:
#include <codecvt> // for std::codecvt_utf8
#include <locale> // for std::wstring_convert
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32;
std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(symb); // converts utf-8 to unicode
std::vector<std::string> array_of_symbols;
for (int i = 0; i < unicode_codepoints.length(); i++)
{
int symb1 = conv_utf8_utf32.from_bytes(symb)[i];
std::stringstream ss;
ss << std::hex << symb1;
std::string res(ss.str());
std::string res1 = "0x" + res;
array_of_symbols.push_back(res1);
}
But instead of obtaining f8f1 I got a different number.
In fact, already the variable unicode_codepoints has the "wrong" chart_32_t variable equal to 9127 (yet correct unicode representation, see the red symbol in fig.1).
The "correct" variable yielding f8f1 should, as it seems, be negative.
What surprises here, is that this code works for almost all other situations. Could someone explain what is wrong here?
And finally, why I'm so sure that the correct representation is f8f1, because the final render of this symbol in svg looks like this (see fig.2) and the encoding in the right upper corner.

Related

UTF-8, sprintf, strlen, etc

I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/

Coding a path in unicode c++

I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this:
https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input char data: c4 87
Output wchar_t data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input char data: 3f
Output wchar_t data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";
Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that &dash; when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.
The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?

Convert double to wstring, but wstring must be formatted with scientific notation

I have a double in format xxxxx.yyyy, for example 0.001500.
I would like to convert it into wstring, formatted with scientific notation. This is the result I want: 0.15e-2.
I am not that experienced with C++, so I checked std::wstring reference and haven't found any member function that can do this.
I have found similar threads here on Stack Overflow but I just do not know how to apply those answers to solve my problem especially as they do not use wchar.
I have tried to solve this myself:
// get the value from database as double
double d = // this would give 0.5
// I do not know how determine proper size to hold the converted number
// so I hardcoded 4 here, so I can provide an example that demonstrates the issue
int len = 3 + 1; // + 1 for terminating null character
wchar_t *txt = new wchar_t[3 + 1];
memset(txt, L'\0', sizeof(txt));
swprintf_s(txt, 4, L"%e", d);
delete[] txt;
I just do not know how to allocate big enough buffer to hold the result of the conversion. Every time I get buffer overflow, and all the answers here from similar thread estimate the size. I really would like to avoid this type of introducing "magic" numbers.
I do not know how to use stringstream either, because those answers did not convert double into wstring with scientific notation.
All I want is to convert double into wstring, with resulting wstring being formatted with scientific notation.
You can use std::wstringstream and the std::scientific flag to get the output you're looking for as a wstring.
#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>
int main(int argc, char * argv[])
{
double val = 0.001500;
std::wstringstream str;
str << std::scientific << val;
std::wcout << str.str() << std::endl;
return 0;
}
You can also set the floating-point precision with additional output flags. Take a look at the reference page for more output manipulators you can use. Unfortunately I don't believe your sample expected output is possible since the correct scientific notation would be 1.5e-3.

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

c++ can't convert string to wstring

I would like to convert a string variable to wstring due to some german characters that cause problem when doing a substr over the variable. The start position is falsified when any these special characters is present before it. (For instance: for "ä" size() returns 2 instead of 1)
I know that the following conversion works:
wstring ws = L"ä";
Since, I am trying to convert a variable, I would like to know if there is an alternative way for it such as
wstring wstr = L"%s"+str //this is syntaxically wrong, but wanted sth alike
Beside that, I have already tried the following example to convert string to wstring:
string foo("ä");
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl;
cout << wfoo.size() << endl;
, but I get errors like
‘wstring_convert’ was not declared in this scope
I am using ubuntu 14.04 and my main.cpp is compiled with cmake. Thanks for your help!
The solution from "hahakubile" worked for me:
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
But the return value is not 100% correct:
string s = "101446012MaßnStörfall PAt #Maßnahme Störfall 00810000100121000102000020100000000000000";
wstring ws2 = s2ws(s);
cout << ws2.size() << endl; // returns 110 which is correct
wcout << ws2.substr(29,40) << endl; // returns #Ma�nahme St�rfall with symbols
I am wondering why it replaced german characters with symbols.
Thanks again!
If you are using Windows/Visual Studio and need to convert a string to wstring you should use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java).
CA2W ca2w(str, CP_UTF8);
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.
The main point is that
string foo("ä")
Is already an error. Start from here and read all answers. And beware, one is very wrong :)