I need to convert from a byte position in a UTF-8 string to the corresponding character position in Objective-C. I'm sure there must be a library to do this, but I cannot find one - does anyone (though obviously any C or C++ library would do the job here).
I realise that I could truncate the UTF-8 string at the required character, convert that to an NSString, then read the length of the NSString to get my answer, but that seems like a somewhat hacky solution to a problem that can be solved quite simply with a small FSM in C.
Thanks for your help.
"Character" is a somewhat ambiguous term, it means something different in different contexts. I'm guessing that you want the same result as your example, [NSString length].
The NSString documentation isn't exactly upfront about this, but [NSString length] counts the number of UTF-16 code units in the string. So U+0000..U+FFFF count as one each, but U+10000..U+10FFFF count as two each. And don't split surrogate pairs!
You can count the number of UTF-16 code points based on the leading byte of each UTF-8 character. The trailing bytes use a disjoint set of values so you don't need to track any state at all, except your position in the string (good news: a finite state machine is overkill).
static const unsigned char BYTE_WIDTHS[256] = {
// 1-byte: 0xxxxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// Trailing: 10xxxxxx
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
// 2-byte leading: 110xxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// 3-byte leading: 1110xxxx
// 4-byte leading: 11110xxx
// invalid: 11111xxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0
};
size_t utf8_utf16width(const unsigned char *string, size_t len)
{
size_t i, utf16len = 0;
for (i = 0; i < len; i++)
utf16len += BYTE_WIDTHS[string[i]];
return utf16len;
}
The table is 1 for the 1-byte, 2-byte, and 3-byte UTF-8 leading characters, and 2 for the 4-byte UTF-8 leading characters because those will end up as two characters when translated to NSString.
I generated the table in Haskell with:
elems $ listArray (0,256) (repeat 0) //
[(n,1) | n <- ([0x00..0x7f] ++ [0xc0..0xdf] ++ [0xe0..0xef])] //
[(n,2) | n <- [0xf0..0xf7]]
Look at the UTF-8 encoding and note that code points begin with the following 8-bit patterns:
76543210 <- bit
0xxxxxxx <- ASCII chars
110xxxxx \
1110xxxx } <- more byte(s) (of form 10xxxxxx) follow
11110xxx /
That's what you should look for when searching for the beginning of a code point.
But that alone is only a part of the solution. You need to take into account Combining characters. You need to take combining diacritical marks together with the main character that precedes them, you cannot just separate them and treat as independent characters.
There's probably even more to it.
I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;
How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.
EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.
In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).
Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.
The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).
Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.
If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.
while (*p != 0)
{
if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
++count;
++p;
}
Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:
while (*p != 0)
{
if ((*p & 0xc0) != 0x80)
++count;
++p;
}
Or if you want to be super clever and make it a 2-liner:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
The Wikipedia page for UTF-8 clearly shows the patterns.
A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
I know, it's late for this thread but, it could help
with ICU stuff, I did it like this:
string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;
I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.
open file
until eof
if file.readchar & 0xC0 != 0x80
increment count
close file