Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++ - c++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character.
If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.
enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };
switch (Encodings)
{
case USASCII:
ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
return new ByteField(ascii.c_str());
case ISO88591:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54F469
return new ByteField(ascii.c_str());
case UTF8:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54C3B469
return new ByteField(ascii.c_str());
case UTF16BE:
ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
//tôi output 005400F40069
return new ByteField(ascii.c_str());
case UTF16:
ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
//tôi output FEFF005400F40069
return new ByteField(ascii.c_str());
case UTF16LE:
ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
//tôi output 5400F4006900
return new ByteField(ascii.c_str());
}
void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
int n = s.length();
for (int i = 0; i < n; i++)
{
unsigned char c = s[i];
long val = long(c);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
result->append(ConvertBinToHex(bin));
}
}
std::string ToUTF16(std::string s, std::string * result, int encodings) {
int n = s.length();
if (encodings == UTF16) {
result->append("FEFF");
}
for (int i = 0; i < n; i++)
{
int val = int(s[i]);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
if (encodings == UTF16 || encodings == UTF16BE) {
result->append("00" + ConvertBinToHex(bin));
}
if (encodings == UTF16LE) {
result->append(ConvertBinToHex(bin) + "00");
}
}
}
std::string ConvertBinToHex(std::string str) {
long long temp = atoll(str.c_str());
int dec_value = 0;
int base = 1;
int i = 0;
while (temp) {
int last_digit = temp % 10;
temp = temp / 10;
dec_value += last_digit * base;
base = base * 2;
}
char hexaDeciNum[10];
while (dec_value != 0)
{
int temp = 0;
temp = dec_value % 16;
if (temp < 10)
{
hexaDeciNum[i] = temp + 48;
i++;
}
else
{
hexaDeciNum[i] = temp + 55;
i++;
}
dec_value = dec_value / 16;
}
str.clear();
for (int j = i - 1; j >= 0; j--) {
str = str + hexaDeciNum[j];
}
return str;
}

The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?
And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it". Please show a minimal, reproducible example along with the input/output
But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"
Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different
To force UTF-8/16/32 encoding you just need to use the u8, u and U suffix respectively, along with the correct type (char8_t, char16_t, char32_t or std::u8string/std::u16string/std::u32string)
std::u8string utf8 = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";
Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string. Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string
Edit:
To convert between UTF encodings use the standard std::codecvt, std::wstring_convert, std::codecvt_utf8_utf16...
Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs
WideCharToMultiByte and MultiByteToWideChar on Windows
iconv on Linux
Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information

-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.
If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function like mbstowcs but you need to set up your locale correctly before using it.
You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.
As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

Related

Retrieve CString file path from XML file

I have an XML file with many values and a working C++ function that can retrieve these values
Two of these values are:
A file path such as: "C:\foo1\foo2" and
A file name: "foo3.txt"
Combining these together, they would become "C:\foo1\foo2\foo3.txt"
However, while trying to set a CString to save a file path, it will give an error because using the character, \, in a string is not allowed due to string notation and its interaction with the \ character.
I am using MFC, and I know WIN32 allows you to create a file path with / instead of \, so: "C:/foo1/foo2/foo3.txt" would work. I tested this in Windows Explorer and it worked.
I would like to collect the file path from XML file, but when it comes in, it will have \ instead of / in its file path, meaning it will not be possible to replace the character (the string coming in will have an error already due to XML not having a problem with the \ character.
How do I safely retrieve the path as a CString, ideally while converting any \ character to a / character.

Now I'm not familiar with the "CString" class you are refering to. Googling the API documentation just has the standard c style char array format commands, so I'm going to assume rightly or wrongly cstring is a char array.
The fact we are going to need to use an object that is not resizable means we either
Need to use the heap, which will be slow, and can leak memory if the memory isn't deleted later
Allow a maximum string length and accept it will be truncated if below this
Heap example (NOTE: I'm not using smart pointers as I assume they don't have access to them, else you'd just std::string and not do this.)
char* escapeString(const char* data, unsigned int length){
//multiplying by 1.5 means this could still truncate,
//but I'm making an educated guess it's not all bad characters.
const int newLen = (length + 1) * 1.5;
char* escaped = new char[newLen + 1];
unsigned int index = 0;
for(unsigned int i = 0; i < length && i < newLen; i++){
if(data[i] == '\\' || data[i] == '\"'){
escaped[index++] = '\\';
}
else if(data[i] == '%'){
escaped[index++] = '%';
}
//else anything else you want to escape
escaped[index++] = data[i];
}
//Make sure a null string is null terminatedescaped
escaped[index] = '\0';
return escaped;
}
int main() {
const char* stringWithBadChars = "I\"m not a %%good \\string";
char* escapedString = escapeString(stringWithBadChars, strlen(stringWithBadChars));
std::cout << escapedString;
delete [] escapedString;
return 0;
}
If we do this on the stack instead it would be a lot faster, but we are limited by the size of the buffer we give, and the size of the buffer in the function. We will return a bool if either fails.
bool escapeString(char* data, unsigned int length){
const int newLen = 1000;
char escaped[1001];
unsigned int index = 0;
for(unsigned int i = 0; i < length && i < newLen; i++){
if(data[i] == '\\' || data[i] == '\"'){
escaped[index++] = '\\';
}
else if(data[i] == '%'){
escaped[index++] = '%';
}
escaped[index++] = data[i];
}
//Make sure a null string is null terminatedescaped
memcpy(data, escaped, index);
escaped[index] = '\0';
return index < length && index < 1000;
}
You could probably get even more efficiency using memmov rather than copy it character by character. Doing it this way you also wouldn't need the second char array.

CString reserves some special characters. Have a look at the Format command as an example. The linked documentation refers you to: Format specification syntax: printf and wprintf functions.
The \ is used as mentioned in the comments to indicate a special character. For example:
\t will insert a tab character.
\" will insert a double quote character.
So when it hits the \ it expects the next character to be one of the special ones. Therefore, when you actually need a backslash, you use \\.
The linked article does explain about % but not the slash. However, tt is exactly the same with % because it too has special meaning. So you would use %% when you want the percent sign.

How to convert an integer to a unicode character?

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers

L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.

The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

Converting a unsigned char(BYTE) array to const t_wchar* (LPCWSTR)

Alright so I have a BYTE array that I need to ultimately convert into a LPCWSTR or const WCHAR* to use in a built in function. I have been able to print out the BYTE array with printf but now that I need to convert it into a string I am having problems... mainly that I have no idea how to convert something like this into a non array type.
BYTE ba[0x10];
for(int i = 0; i < 0x10; i++)
{
printf("%02X", ba[i]); // Outputs: F1BD2CC7F2361159578EE22305827ECF
}
So I need to have this same thing basically but instead of printing the array I need it transformed into a LPCWSTR or WCHAR or even a string. The main problem I am having is converting the array into a non array form.

LPCWSTR represents a UTF-16 encoded string. The array contents you have shown are outside the 7bit ASCII range, so unless the BYTE array is already encoded in UTF-16 (the array you showed is not, but if it were, you could just use a simple type-cast), you will need to do a conversion to UTF-16. You need to know the particular encoding of the array before you can do that conversion, such as with the Win32 API MultiByteToWideChar() function, or third-party libraries like iconv or ICU, or built-in locale convertors in C++11, etc. So what is the actual encoding of the array, and where is the array data coming from? It is not UTF-8, for instance, so it has to be something else.

Alright I got it working. Now I can convert the BYTE array to a char* var. Thanks for the help guys but the formatting wasn't a large problem in this instance. I appreciate the help though, its always nice to have some extra input.
// Helper function to convert
Char2Hex(unsigned char ch, char* szHex)
{
unsigned char byte[2];
byte[0] = ch/16;
byte[1] = ch%16;
for(int i = 0; i < 2; i++)
{
if(byte[i] >= 0 && byte[i] <= 9)
{
szHex[i] = '0' + byte[i];
}
else
szHex[i] = 'A' + byte[i] - 10;
}
szHex[2] = 0;
}
// Function used throughout code to convert
CharStr2HexStr(unsigned char const* pucCharStr, char* pszHexStr, int iSize)
{
int i;
char szHex[3];
pszHexStr[0] = 0;
for(i = 0; i < iSize; i++)
{
Char2Hex(pucCharStr[i], szHex);
strcat(pszHexStr, szHex);
}
}

Convert QString to Hex?

I have a QString where I append data input from the user.
At the end of the QString, I need to append the hexadecimal representation of a "Normal" QString.
For example:
QString Test("ff00112233440a0a");
QString Input("Words");
Test.append(Input);//but here is where Input needs to be the Hex representation of "Words"
//The resulting variable should be
//Test == "ff00112233440a0a576f726473";
How can I convert from ASCII (I think) to it's Hex representation?
Thanks for your time.

You were very close:
Test.append(QString::fromLatin1(Input.toLatin1().toHex()));

Another solution to your problem.
Given a character, you can use the following simple function to compute its hex representation.
// Call this function twice -- once with the first 4 bits and once for the last
// 4 bits of a char to get the hex representation of a char.
char toHex(char c) {
// Assume that the input is going to be 0-F.
if ( c <= 9 ) {
return c + '0';
} else {
return c + 'A' - 10;
}
}
You can use it as:
char c;
// ... Assign a value to c
// Get the hex characters for c
char h1 = toHex(c >> 4);
char h2 = toHex(c & 0xF);

Reading Unicode characters from a file in C++

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.
Can anyone to tell me how to do that?

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description
Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:
(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.
(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:
UnicodeByte1 = (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);
And so on...
Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:
#include <iostream>
#include <string>
#include <fstream>
std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());
The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:
for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
char nextChar = *i;
// do stuff here.
}
Alternatively, you could open the file in binary mode, and then move through each byte that way:
std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
char nextChar;
while (fs.good()) {
fs >> nextChar;
// do stuff here.
}
}
If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.
QFile file;
if (file.open("my_file.text") {
QTextStream in(&file);
in.setCodec("UTF-8")
QString contents = in.readAll();
return;
}

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:
{
if(i_ch == nullptr) return -1;
int l = 0;
char ch = *i_ch;
int mask = 0x80;
while(ch & mask) {
l++;
mask = (mask >> 1);
}
if (l < 4) return -1;
return l;
}
It's take less time than research how shell using mblen.

try this: get the file and then loop through the text based on it's length
Pseudocode:
String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
String the_character = s[i].
// TODO : Do your thing :o)
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js