Get "char" of a multi-byte character in linux/mac - c++

I have and std::string with utf-8 characters (some latin, some non-latin) in linux and mac.
As we know, utf-8 char size is not fixed, and some of the characters are not just 1 byte (like regular latin characters).
The question is how can I get the character in offset i?
It makes sense to use int32 data type to store the char, but how do I get that character?
For example:
std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong
It is important to point out that I do not know the size of character in offset i.

It's very simple.
First, you have to understand, you cant calculate the position without iterating the string (that's obvious fr var-length characters)
Second, you need to remember that in utf-8 characters can be 1-4 bytes and in case they occupy more than one byte, all trailing bytes have 10 significant bits set. So, you just count bytes, ignoring them if (byte_val & 0xC0) == 0x80.
Unfortunately, I don't have compiler at my disposal right now, so please be kind to possible mistakes in the code:
int desired_index = 19;
int index = 0;
char* p = my_str.c_str();
while ( *p && index < desired_index ){
if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
index++;
p++;
}
// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
p++;
if ( *p ){
// here p points to your desired char
} else {
// we reached EOL while searching
}

Related

What does this char string related piece of C++ code do?

bool check(const char *text) {
char c;
while (c = *text++) {
if ((c & 0x80) && ((*text) & 0x80)) {
return true;
}
}
return false;
}
What's 0x80 and the what does the whole mysterious function do?
Testing the result of an x & 0x80 expression for non-zero (as is done twice in the code you show) checks if the most significant bit (bit 7) of the char operand (x) is set1. In your case, the code loops through the given string looking for two consecutive characters (c, which is a copy of the 'current' character, and *test, the next one) with that bit set.
If such a combination is found, the function returns true; if it is not found and the loop reaches the nul terminator (so that the c = *text++ expression becomes zero), it returns false.
As to why it does such a check – I can only guess but, if that upper bit is set, then the character will not be a standard ASCII value (and may be the first of a Unicode pair, or some other multi-byte character representation).
Possibly helpful references:
Bitwise operators
Hexadecimal constants
1 Note that this bitwise AND test is really the only safe way to check that bit, because the C++ Standard allows the char type to be either signed (where testing for a negative value would be an alternative) or unsigned (where testing for >= 128 would be required); either of those tests would fail if the implementation's char had the 'wrong' type of signedness.
I can't be totally sure without more context, but it looks to me like this function checks to see if a string contains any UTF-8 characters outside the classic 7-bit US-ASCII range.
while (c=*text++) will loop until it finds the nul-terminator in a C-style string; assigning each char to c as it goes. c & 0x80 checks if the most-significant-bit of c is set. *text & 0x80 does the same for the char pointed to by text (which will be the one after c, since it was incremented as part of the while condition).
Thus this function will return true if any two adjacent chars in the string pointed to by text have their most-significant-bit set. That's the case for any code points U+0080 and above in UTF-8; hence my guess that this function is for detecting UTF-8 text.
Rewriting to be less compact:
while (true)
{
char c = *text;
text += 1;
if (c == '\0') // at the end of string?
return false;
int temp1 = c & 0x80; // test MSB of c
int temp2 = (*text) & 0x80; // test MSB of next character
if (temp1 != 0 && temp2 != 0) // if both set the return true
return true;
}
MSB means Most Significant Bit. Bit7. Zero for plain ascii characters

std::string optimal way to truncate utf-8 at safe place

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}

How can i get utf-8 char number from binary in c++?

For example, I have: 11100011 10000010 10100010. It is the binary of: ア;
its number in UTF-8 is:12450
How can I get this number from binary?
The byte sequence you're showing is the UTF-8 encoded version of the character.
You need to decode the UTF-8 to get to the Unicode code point.
For this exact sequence of bytes, the following bits make up the code point:
11100011 10000010 10100010
**** ****** ******
So, concatenating the asterisked bits we get the number 0011000010100010, which equals 0x30a2 or 12450 in decimal.
See the Wikipedia description for details on how to interpret the encoding.
In a nutshell: if bit 7 is set in the first byte, the number of adjacent bits (call it m) that are also set (2) gives the number of bytes that follow for this code point. The number of bits to extract from each byte is (8 - 1 - 1 - m) for the first byte, and 6 bits from each subsequent byte. So here we got (8 - 1 - 1 - 2) = 4 + 2 * 6 = 16 bits.
As pointed out in comments, there are plenty of libraries for this, so you might not need to implement it yourself.
working from the wikipedia page, I came up with this:
unsigned utf8_to_codepoint(const char* ptr) {
if( *ptr < 0x80) return *ptr;
if( *ptr < 0xC0) throw unicode_error("invalid utf8 lead byte");
unsigned result=0;
int shift=0;
if( *ptr < 0xE0) {result=*ptr&0x1F; shift=1;}
if( *ptr < 0xF0) {result=*ptr&0x0F; shift=2;}
if( *ptr < 0xF8) {result=*ptr&0x07; shift=3;}
for(; shift>0; --shift) {
++ptr;
if (*ptr<0x7F || *ptr>=0xC0)
throw unicode_error("invalid utf8 continuation byte");
result <<= 6;
result |= *ptr&0x6F;
}
return result;
}
Note that this is a very poor implementation (I highly doubt it even compiles), and parses a lot of invalid values that it probably shouldn't. I put this up merely to show that it's a lot harder than you'd think, and that you should use a good unicode library.

C++ any hex to char

Is there a better way from hex to char?
char one = static_cast<char>(0x01);
(asking because of this --> C++ using pointers, nothing is passed )
Also is there a fast way to make a char array out of hex values (eg. 0x12345678 to a char array)?
You can try this:
std::string hexify(unsigned int n)
{
std::string res;
do
{
res += "0123456789ABCDEF"[n % 16];
n >>= 4;
} while(n);
return std::string(res.rbegin(), res.rend());
}
Credits to STL for the "index into char array" trick.
Also beware when printing chars, which are signed on some platforms. If you want 128 to print as 80 rather than FFFFFFFF, you have to prevent it from being treated as -1 by converting to unsigned char first: hexify((unsigned char)(c));
What do you intend to be stored in the variable one?
The code as written will store the ASCII character 0x01 into one. This is a control character, not a printable character. If you're looking for the digit 1, then you need to say so explicitly:
char one = '1';
That stores the actual character, not the ASCII code 0x01.
If you are trying to convert a number into the string representation of that number, then you need to use one of these mechanisms. If instead, you are trying to treat a 32-bit integer as a sequence of 4 bytes, each of which is an ASCII character, that is a different matter. For that, you could do this:
uint32_t someNumber = 0x12345678;
std::string myString(4, ' ');
myString[0] = static_cast<char>((someNumber >> 24) & 0xFF);
myString[1] = static_cast<char>((someNumber >> 16) & 0xFF);
myString[2] = static_cast<char>((someNumber >> 8) & 0xFF);
myString[3] = static_cast<char>((someNumber) & 0xFF);

Extra numbers being appended to string, don't know why?

Number(string binary)
{
int raw_string_int[size];
char raw_string_char[size];
strcpy(raw_string_char,binary.c_str());
printf("Raw String is %s",raw_string_char);
for (int i=0;i<size;i++)
{
raw_string_int[i] = int(raw_string_char[i]);
printf("%i\n",int(raw_string_char[i]));
if (raw_string_int[i] != 0 || raw_string_int[i] != 1)
{
printf("ERROR NOT A BINARY NUMBER\n");
exit(0);
}
}
Hi, I'm entering 0001 as binary at the command prompt, but raw_string_char is being appended with two extra numbers. Can anyone explain to me why this is? Is the carriage return being brought in as a char?
Here is What I'm getting at the command prompt:
./test
0001
Raw String is 000148
ERROR NOT A BINARY NUMBER
You forgot the "\n" in your first printf. The 48 is from the second printf, and is the result of casting the first '0' (ASCII 0x30 = 48) to an int.
To convert a textual 0 or 1 to the corresponding integer, you need to subtract 0x30.
Your assumption that char('0') == int(0) and char('1') == int(1) just doesn't hold. In ASCII these characters have the values of 48 and 49.
What you should do to get integer values of digit characters is substract '0' instead of simple casting (raw_string_int[x] = raw_string_char[x] - '0';).
I think you have conceptual problems though. The array can't be full of valid values to the end (the corresponding C-string would at least contain a null-terminator, which is not a valid binary character). You can use the string's size() method to find out how many characters the string actually contains. And naturally you are risking buffer overflows, should the binary string contain size characters or more.
If the intention is to check if the input is a valid binary number, why can't you test the original string, why would you copy data around to two more arrays?
You're printing every character in raw_string_char. C-style strings go until the first zero character (that's '\0', not 0).
Change to for (int i = 0; raw_string_char[i] != 0 && i < size; i++).
Like others said, '0' is converted to an integer 48. You don't really need to convert the C++ string to a C style string. You can use iterators or the index operator [] on the C++ string. You also need to use logical AND && rather than logical OR || in your if statement.
#include<cstdio>
#include<string>
void Number(std::string binary) {
for(std::string::const_iterator i = binary.begin(); i != binary.end(); i++ )
if( *i != '0' && *i != '1')
{
printf("ERROR NOT A BINARY NUMBER\n");
return;
}
}
int main() {
Number("0001");
}
The raw_string_char is never initialized, the extra characters are possibly due to this. Use memset to initialize the array.
memset(raw_string_array, 0, size);