I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.
There are two aspects to this question.
Distinguish UTF-8-encoded characters from ordinary ASCII characters.
UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.
Display the encoded characters in hexadecimal.
C++ has std::hex to tell streams to format numeric values in hexadecimal. You can use std::showbase to make the output look pretty. A char isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such as int. Beware of sign-extension, though.
Here's some code to demonstrate:
#include <iostream>
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; ++pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
You could call it like this:
int main()
{
char const* test = "ab\xef\xbb\xbfhu";
print_characters(test);
return 0;
}
Output on Solaris 10 with Sun C++ 5.8:
$ ./a.out
a b 0xef 0xbb 0xbf h u
The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.
I used *pc & 0xff to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb, for instance.
I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.
Related
I want to convert Unicode characters (Persian) to int.
Based on this list, the Unicode number of 'آ' is U+0622.
Suppose i want to give U+0622 as integer value. I wrote this piece of code:
unsigned int Alef = (unsigned int)'آ';
std::cout << Alef << std::endl;
output:
63
Correct Answer is 1570 and as you see the output is wrong. I guess it only converts first byte of Unicode Character.
How do i convert that Unicode character to give correct answer?
Try expressing the character as a wchar literal:
unsigned int Alef = (unsigned int) L'آ';
std::cout << Alef << std::endl;
But make sure you're saving as Unicode, nano, for example, converts the 'آ' to a '?' before saving. As would Notepad on Windows I think?
Also to add to my answer, you should write Unicode characters to std::wcout not std::cout as cout is for single byte chars and wcout is for wchar types.
EDIT: Notepad does save as Unicode
I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/
I found an explanation to decode hex-representations into decimal but only by using Qt:
How to get decimal value of a unicode character in c++
As I am not using Qt and cout << (int)c does not work (Edit: it actually does work if you use it properly..!):
How to do the following:
I got the hex representation of two chars which were transmitted over some socket (Just figured out how to get the hex repr finally!..) and both combined yield following utf16-representation:
char c = u"\0b7f"
This shall be converted into it's utf16 decimal value of 2943!
(see it at utf-table http://www.fileformat.info/info/unicode/char/0b7f/index.htm)
This should be absolut elementary stuff, but as a designated Python developer compelled to use C++ for a project I am hanging this issue for hours....
Use a wider character type (char is only 8 bits, you need at least 16), and also the correct format for UTC literals. This works (live demo):
#include <iostream>
int main()
{
char16_t c = u'\u0b7f';
std::cout << (int)c << std::endl; //output is 2943 as expected
return 0;
}
As per xsd, the supported binary types are hexbinary and base64 encoded binary data. http://www.w3schools.com/schema/schema_dtypes_misc.asp
My intention is to read raw byte contents from the memory and serialize it to the xml file. Hence, what data type above would describe the raw byte contents OR do i have to make sure that the raw byte contents are converted to hexdecimal to adhere to one of the 2 data types described above ?
You do have to convert the raw binary to hexadecimal (or base64) representation. Eg, if the value of the byte is 255 (in decimal), it's hex representation (as a string) would be "ff".
The (conventional) type to use for storing the raw input is unsigned char, so you can get the ranges 0-255 easily byte by byte, but for each byte of that array, you need two bytes in a signed char (or std::string) type to store the representation, and that is what you use in the XML.
Your framework probably has a method for converting raw bytes to Base64 or hex. If not, here's one method for hex:
#include <iostream>
#include <string>
#include <sstream>
using namespace std;
int main (void) {
ostringstream os;
os.flags(ios::hex);
unsigned char data[] = { 0, 123, 11, 255, 66, 99 };
for (int i = 0; i < 6; i++) {
if (data[i] < 16) os << '0';
os << (int)data[i] << '|';
}
string formatted(os.str());
cout << formatted << endl;
return 0;
}
Outputs: 00|7b|0b|ff|42|63|
You need to encode the raw data to one of the two data types. This is to keep some random data from messing up the XML format, for example if you had a < embedded in the data somewhere.
You can choose whichever of the two is most convenient for you. The hexadecimal type is easier to write code for but produces a larger file - the ratio of bytes out to bytes in is 2:1, where it is 4:3 for the Base64 encoding. You shouldn't need to write your own code though, Base64 conversion functions are readily available. Here's a question that has some code in the answers: How do I base64 encode (decode) in C?
As an example of how the codings differ, here's the phrase "The quick brown fox jumps over the lazy dog." encoded both ways.
Hex:
54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672e
Base64:
VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4=
I have the lovely functions from my previous question, which work fine if I do this:
wstring temp;
wcin >> temp;
string whatever( toUTF8(getSomeWString()) );
// store whatever, copy, but do not use it as UTF8 (see below)
wcout << toUTF16(whatever) << endl;
The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.
Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?
Thanks!
PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).
Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio).
What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:
if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
cerr << "Failed to set console output mode!\n";
return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
DWORD const err = GetLastError();
cerr << "WriteConsole failed with << " << err << "!\n";
return 1;
}
This should output:
Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.
As for your question (taken from your comment) if
a win23 API converted string is the
same as a raw UTF8 (linux) string
I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.
When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.
I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.
The good news is there are many converters out there written to convert these strings to different character sets.