c++: how to create unsigned char from UTF-8 code point - c++

I'm working with a C++ library, and need to create an unsigned char from a UTF-8 code point. For example, if the code point is decimal 610 (a 'latin letter small capital G'), how would I create this in C++?
I javascript, I can do the following:
var temp = String.fromCharCode(610);
console.log(temp); // Outputs a small 'G' (correct)
var codePoint = temp.charCodeAt(0);
console.log(codePoint); // Outputs 610 (correct)
In C++ have tried:
unsigned char temp = (unsigned char)610;
// compiles, but
Debug::WriteLine((int)temp); // outputs 98 (??)
Please provide a code example in C++ which performs the same as the javascript example above.
The environment is in managed C++, but I want to avoid using CLR types as I'm interfacing with a 3rd party library.

An unsigned char is to small to hold a value of 610 (assuming a char is 8 bits wide, it can only hold values from 0 to 255), so it will wrap around*
Use char16_t to store a 16-bit char (or char32_t for a 32-bit char, which UTF-8 requires).
char32_t temp = (char32_t)610;
Debug::WriteLine(temp); // outputs 610 (!!)
If you want to handle UTF-8 strings, use UTF-8 string literals:
u8"I'm a UTF-8 string."
*It will wrap around even twice in your example:
610 - 256 - 256 = 98

Unicode code points may need 32 bit representations. In most western languages, 16 bits are enough, but to handle all possible Unicode code points, you really do need 32 bits.
uint32_t codePoint = someString.CodePointAt(x);
You can read more about it here: http://en.wikipedia.org/wiki/Code_point.

If you mean you want to create an unsigned char pointing to the UTF-8 representation of the Unicode code point 610 you could do:
char unsigned temp[] = { 0xc9, 0xa2 };

Related

Difference between converting int to char by (char) and by ASCII

I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.

Convert unsigned int formatted in HEX to string

I am developing a hmac-sha1 class for my exam. I've a problem when I have to apply sha-1 two times as described in https://en.wikipedia.org/wiki/Hash-based_message_authentication_code.
When I apply sha1 to a string, it returns me a unsigned int [5] with the hash calculated. I want to convert unsigned int [5] into a char [40] with the hash.
For example
unsigned int H[5] = { 67452301, EFCDAB89, 98BADCFE, 10325476, C3D2E1F0 };
// char [40] will be "67452301EFCDAB8998BADCFE10325476C3D2E1F0"
So, I can concatenate it to ipad, then calculate his hash ipad_hash and finally calculate opad+ipad_hash concatenating the two string.
Is it right?
I'm using arduino uno so "unsigned int" is "unsigned long".
This is my test code (it's a mess but i will clean it): http://pastebin.com/jfwBxAp1
You can do
char hash_cstr[41];
sprintf(hash_cstr, "%08lX%08lX%08lX%08lX%08lX", H[0], H[1], H[2], H[3], H[4])
Make sure you allocate at least 41 chars (40 for the code and 1 for the NULL terminator).
In the format string %08X 08 means pad to 8 characters using 0, so that you get the leading 0s for the middle bytes, and X means hex format using upper case characters. You can use lower case x for lower case characters. Hex format automatically assumes unsigned. You can use lX for 64 bit types.

Converting 2 chars to its ascii binary code

I'm reading binary data in character format from an accelerometer and it consists of higher byte and lower byte. It's a long time since I worked with C++ and usually only used higher level stuff.
I have the following function:
short char2short(char* hchar, char* lchar)
{
char temp[2];
temp[0] = *hchar;
temp[1] = *lchar;
How can I get that values converted to an integer?
atoi works different as far as I know (e.g. "21" = 21).
Can I just typecast char to int? But how does it work with higher bit and lower bit?
Thanks in advance for any help!
You should store the bytes as unsigned to avoid issues with shifting sign bits.
short char2short(unsigned char hchar, unsigned char lchar)
{
return static_cast<short>(lchar | (hchar << 8));
}
You may also want to use unsigned short. It depends what you expect.

printf escaped unicode character from integer

I'm doing a rewrite of this question.
I want to create a string with a unicode escaped character such as "\u03B1" using an integer constant. For example, this string is the greek letter alpha.
const char *alpha = "\u03B1"
I want to construct the same string using a call to printf using the integer value 0x03B1. For this example it can be done like this but I'm not sure to get those two numbers from 0x03B1.
printf("%c%c", 206, 177);
This link explains what to do but I'm not sure how to do it.
http://www.fileformat.info/info/unicode/utf8.htm
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
NOTE: I do not want to create the string "\\u03B1" with a backslash. This is different than "\u03B1" which is an escaped unicode character.
It appears that even the most recent C and C++ standards are a bit disappointing in their handling of Unicode.
For those who are confused about the example in the question, like I was:
const char *alpha = "\u03B1"
In C99, this will store a pointer to the string "α" (U+03B1) in alpha. In C89, this is invalid syntax.
I could not find a way to use the \u syntax with a variable or integer constant, like what the question was requesting. You may be better off using a library to add better Unicode support to your program. I have not used the ICU library, but it sounds promising.
How to convert a Unicode code point to characters in C++ using ICU?: possibly an answer to your question
Unicode Processing in C++: a related Stack Overflow question
I figured it out.
The first byte contains the 5 upper bits 0x7c0 is 11111000000 and the second byte contains the lower 5 bits 0x3f is 00000111111 of the unicode value.
The first byte uses the mask 0xc0 is 11000000 to set the two high bits and the second byte uses 0x80 is 10000000 to set the first high bit.
int alpha = 0x03B1; // 945
char byte1 = 0xc0 | ((alpha & 0x7c0) >> 6); // 206
char byte2 = 0x80 | (alpha & 0x3f); // 177
printf("%c%c", byte1, byte2);

Unexpected results when looking at ASCII codes in C++

The bit of code below is extracting ASCII codes from characters.
When I convert characters in the normal ASCII region I get the value I expect.
When I convert £ and € from the extened region I get a load of 1's padding the INT that I'm storing the character in.
e.g. the output of the below is:
45 (ascii E as expected)
FFFFFF80 (extended ascii € as expected but padded with ones)
It's not causing me an issue but I'm just wondering why this happens.
Here's the code...
unsigned int asciichar[3];
string cTextToEncode = "E€";
for (unsigned int i = 0; i < cTextToEncode.length(); i++)
{
asciichar[i] = (unsigned int)cTextToEncode[i];
cout << hex << asciichar[i] << "\n";
}
Can anyone explain why this is?
Thanks
depending on the implementation a char can be either signed or unsigned. In your case they appear to be signed, so 0x80 is interpreted as -128 instead of 128, hence when cast to an integer it becomes 0xffffff80.
btw, this has nothing at all to do with ASCII
First, there's no € in ASCII (extended or otherwise) because the euro didn't exist when ASCII was created. However, several ASCII-friendly 8-bit encodings do support the € character, but the conversion is done by your source code editor (the compiler merely sees a byte which happens to represent € in your editor, but might be something else entirely on, say, a computer in Israel).
Second, (unsigned int) casts do not extract the ASCII encoding of a character. They merely convert the value of the underlying numeric char type to an unsigned integer. This causes strange things to happen when the converted value is negative - on your compiler, char happens to be signed char and thus characters with an ASCII value larger than 127 end up being negative char values.
You should convert to an unsigned char first, and then to an unsigned int.
You should be careful when promoting signed values.
When promoting signed char to signed int a first bit (sign bit) is taken into account. The algorithm is roughly look like this:
1) If you have 1X-XX-XX-XX (char in binary, X - any binary digit) then int will be (starts with 24 ones) 1...1-1X-XX-XX-XX (binary) -> 0xFFFFFFYY (hex)
2) if you have 0X-XX-XX-XX (binary), then you'll have (starts with 24 zeroes) 0...0-0X-XX-XX-XX (binary) -> 0x000000YY (hex).
In your case you want to force rule #2 all the time. In order to do this, you need to tell compiler to ignore first bit (sign bit). For this you need to use unsigned char.