std::string optimal way to truncate utf-8 at safe place

std::string optimal way to truncate utf-8 at safe place - c++

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?

If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}

Related

Counting UTF-16 characters

Lets say that I have a array of UTF-16 bytes. How would I count how many characters are in the array of UTF-16 bytes? The array can also be inside of a boundary. For example, lets say that there is a 4 byte UTF-16 character, and only 3 of the 4 bytes are read into a buffer. I then try counting that 3 byte buffer. How would I detect that there is not enough bytes?

4-byte UTF-16 characters are called surrogate pairs, and all surrogate pairs start with a code unit in the range 0xD800 to 0xDBFF inclusive. To count the number of characters (aka code points) in a UTF-16 string you therefore want to do something like this (pseudo-code):
char_count = 0;
string_pos = 0;
while (!end_of_string)
{
code_unit = input_string [string_pos];
++char_count;
if (code_unit >= 0xd800 && code_unit <= 0xdbff)
string_pos += 2;
else
++string_pos;
}
To detect an incomplete surrogate pair, just check if there are any code units left in the string after detecting the lead-in value. You might also want to check for invalid surrogate pairs.
Wikipedia has a good write-up on UTF-16 here.

Examine the state of the decoder. If the state is READY then it is enough, otherwise it is not enough. Of course, you need to maintain the state with each incoming code point.

printf escaped unicode character from integer

I'm doing a rewrite of this question.
I want to create a string with a unicode escaped character such as "\u03B1" using an integer constant. For example, this string is the greek letter alpha.
const char *alpha = "\u03B1"
I want to construct the same string using a call to printf using the integer value 0x03B1. For this example it can be done like this but I'm not sure to get those two numbers from 0x03B1.
printf("%c%c", 206, 177);
This link explains what to do but I'm not sure how to do it.
http://www.fileformat.info/info/unicode/utf8.htm
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
NOTE: I do not want to create the string "\\u03B1" with a backslash. This is different than "\u03B1" which is an escaped unicode character.

It appears that even the most recent C and C++ standards are a bit disappointing in their handling of Unicode.
For those who are confused about the example in the question, like I was:
const char *alpha = "\u03B1"
In C99, this will store a pointer to the string "α" (U+03B1) in alpha. In C89, this is invalid syntax.
I could not find a way to use the \u syntax with a variable or integer constant, like what the question was requesting. You may be better off using a library to add better Unicode support to your program. I have not used the ICU library, but it sounds promising.
How to convert a Unicode code point to characters in C++ using ICU?: possibly an answer to your question
Unicode Processing in C++: a related Stack Overflow question

I figured it out.
The first byte contains the 5 upper bits 0x7c0 is 11111000000 and the second byte contains the lower 5 bits 0x3f is 00000111111 of the unicode value.
The first byte uses the mask 0xc0 is 11000000 to set the two high bits and the second byte uses 0x80 is 10000000 to set the first high bit.
int alpha = 0x03B1; // 945
char byte1 = 0xc0 | ((alpha & 0x7c0) >> 6); // 206
char byte2 = 0x80 | (alpha & 0x3f); // 177
printf("%c%c", byte1, byte2);

How to convert between character and byte position in Objective-C/C/C++

I need to convert from a byte position in a UTF-8 string to the corresponding character position in Objective-C. I'm sure there must be a library to do this, but I cannot find one - does anyone (though obviously any C or C++ library would do the job here).
I realise that I could truncate the UTF-8 string at the required character, convert that to an NSString, then read the length of the NSString to get my answer, but that seems like a somewhat hacky solution to a problem that can be solved quite simply with a small FSM in C.
Thanks for your help.

"Character" is a somewhat ambiguous term, it means something different in different contexts. I'm guessing that you want the same result as your example, [NSString length].
The NSString documentation isn't exactly upfront about this, but [NSString length] counts the number of UTF-16 code units in the string. So U+0000..U+FFFF count as one each, but U+10000..U+10FFFF count as two each. And don't split surrogate pairs!
You can count the number of UTF-16 code points based on the leading byte of each UTF-8 character. The trailing bytes use a disjoint set of values so you don't need to track any state at all, except your position in the string (good news: a finite state machine is overkill).
static const unsigned char BYTE_WIDTHS[256] = {
// 1-byte: 0xxxxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// Trailing: 10xxxxxx
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
// 2-byte leading: 110xxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// 3-byte leading: 1110xxxx
// 4-byte leading: 11110xxx
// invalid: 11111xxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0
};
size_t utf8_utf16width(const unsigned char *string, size_t len)
{
size_t i, utf16len = 0;
for (i = 0; i < len; i++)
utf16len += BYTE_WIDTHS[string[i]];
return utf16len;
}
The table is 1 for the 1-byte, 2-byte, and 3-byte UTF-8 leading characters, and 2 for the 4-byte UTF-8 leading characters because those will end up as two characters when translated to NSString.
I generated the table in Haskell with:
elems $ listArray (0,256) (repeat 0) //
[(n,1) | n <- ([0x00..0x7f] ++ [0xc0..0xdf] ++ [0xe0..0xef])] //
[(n,2) | n <- [0xf0..0xf7]]

Look at the UTF-8 encoding and note that code points begin with the following 8-bit patterns:
76543210 <- bit
0xxxxxxx <- ASCII chars
110xxxxx \
1110xxxx } <- more byte(s) (of form 10xxxxxx) follow
11110xxx /
That's what you should look for when searching for the beginning of a code point.
But that alone is only a part of the solution. You need to take into account Combining characters. You need to take combining diacritical marks together with the main character that precedes them, you cannot just separate them and treat as independent characters.
There's probably even more to it.

Understanding Endianness - a variable value

I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?

The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.

the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");

That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.

For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.

The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.

The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.

In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;

counting unicode characters in c++

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.
EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).
Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.
The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).
Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.
while (*p != 0)
{
if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
++count;
++p;
}
Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:
while (*p != 0)
{
if ((*p & 0xc0) != 0x80)
++count;
++p;
}
Or if you want to be super clever and make it a 2-liner:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
The Wikipedia page for UTF-8 clearly shows the patterns.

A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

I know, it's late for this thread but, it could help
with ICU stuff, I did it like this:
string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;

I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.
open file
until eof
if file.readchar & 0xC0 != 0x80
increment count
close file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js