I need to convert from a byte position in a UTF-8 string to the corresponding character position in Objective-C. I'm sure there must be a library to do this, but I cannot find one - does anyone (though obviously any C or C++ library would do the job here).
I realise that I could truncate the UTF-8 string at the required character, convert that to an NSString, then read the length of the NSString to get my answer, but that seems like a somewhat hacky solution to a problem that can be solved quite simply with a small FSM in C.
Thanks for your help.
"Character" is a somewhat ambiguous term, it means something different in different contexts. I'm guessing that you want the same result as your example, [NSString length].
The NSString documentation isn't exactly upfront about this, but [NSString length] counts the number of UTF-16 code units in the string. So U+0000..U+FFFF count as one each, but U+10000..U+10FFFF count as two each. And don't split surrogate pairs!
You can count the number of UTF-16 code points based on the leading byte of each UTF-8 character. The trailing bytes use a disjoint set of values so you don't need to track any state at all, except your position in the string (good news: a finite state machine is overkill).
static const unsigned char BYTE_WIDTHS[256] = {
// 1-byte: 0xxxxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// Trailing: 10xxxxxx
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
// 2-byte leading: 110xxxxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
// 3-byte leading: 1110xxxx
// 4-byte leading: 11110xxx
// invalid: 11111xxx
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0
};
size_t utf8_utf16width(const unsigned char *string, size_t len)
{
size_t i, utf16len = 0;
for (i = 0; i < len; i++)
utf16len += BYTE_WIDTHS[string[i]];
return utf16len;
}
The table is 1 for the 1-byte, 2-byte, and 3-byte UTF-8 leading characters, and 2 for the 4-byte UTF-8 leading characters because those will end up as two characters when translated to NSString.
I generated the table in Haskell with:
elems $ listArray (0,256) (repeat 0) //
[(n,1) | n <- ([0x00..0x7f] ++ [0xc0..0xdf] ++ [0xe0..0xef])] //
[(n,2) | n <- [0xf0..0xf7]]
Look at the UTF-8 encoding and note that code points begin with the following 8-bit patterns:
76543210 <- bit
0xxxxxxx <- ASCII chars
110xxxxx \
1110xxxx } <- more byte(s) (of form 10xxxxxx) follow
11110xxx /
That's what you should look for when searching for the beginning of a code point.
But that alone is only a part of the solution. You need to take into account Combining characters. You need to take combining diacritical marks together with the main character that precedes them, you cannot just separate them and treat as independent characters.
There's probably even more to it.
I don't use correctly the format specifiers in C. A few lines of code:
int main()
{
char dest[]="stack";
unsigned short val = 500;
char c = 'a';
char* final = (char*) malloc(strlen(dest) + 6);
snprintf(final, strlen(dest)+6, "%c%c%hd%c%c%s", c, c, val, c, c, dest);
printf("%s\n", final);
return 0;
}
What I want is to copy at
final [0] = a random char
final [1] = a random char
final [2] and final [3] = the short array
final [4] = another char ....
My problem is that i want to copy the two bytes of the short int to 2 bytes of the final array.
thanks.
I'm confused - the problem is that you are saying strlen(dest)+6 which limits the length of the final string to 10 chars (plus a null terminator). If you say strlen(dest)+8 then there will be enough space for the full string.
Update
Even though a short may only be 2 bytes in size, when it is printed as a string each character will take up a byte. So that means it can require up to 5 bytes of space to write a short to a string, if you are writing a number above 10000.
Now, if you write the short to a string as a hexadecimal number using the %x format specifier, it will take up no more than 2 bytes.
You need to allocate space for 13 characters - not 11. Don't forget the terminating NULL.
When formatted the number (500) takes up three spaces, not one. So your snsprintf should give the final length as strlen(dest)+5+3. Then also fix your malloc call to adjust. If you want to compute the strlen of the number, do that with a call like this strlen(itoa(val)). Also, cant forget the NULL at the end of dest, but I think strlen takes this into account, but I'm not for sure.
Simple answer is you only allocated enough space for the strlen(dest) + 6 characters when in all reality it looks like you're going to have 8 extra characters... since you have 2 chars + 3 chars in your number + 2 chars after + dest (5 chars) = 13 char when you allocated 11 chars.
Unsigned shorts can take up to 5 characters, right? (0 - 65535)
Seems like you'd need to allocate 5 characters for your unsigned short to cover all of the values.
Which would point to using this:
char* final = (char*) malloc(strlen(dest) + 10);
You lose one byte because you think the short variable takes 2 byte. But it takes three: one for each digit character ('5', '0', '0'). Also you need a '\0' terminator (+1 byte).
==> You need strlen(dest) + 8
Use 8 instead of 6 on:
char* final = (char*) malloc(strlen(dest) + 6);
and
snprintf(final, strlen(dest)+6, "%c%c%hd%c%c%s", c, c, val, c, c, dest);
Seems like the primary misunderstanding is that a "2-byte" short can't be represented on-screen as 2 1-byte characters.
First, leave enough room:
char* final = (char*) malloc(strlen(dest) + 9);
The entire range of possible values for a 1-byte character are not printable. If you want to display this on screen and be readable, you'll have to encode the 2-byte short as 4 hex bytes, such as:
## as hex, 4 characters
snprintf(final, sizeof(final), "%c%c%4x%c%c%s", c, c, val, c, c, dest);
If you are writing this to a file, that's OK, and you might try the following:
## print raw bytes, upper byte, then lower byte.
snprintf(final, sizeof(final), "%c%c%c%c%c%c%s", c, c, ((val<<8)&0xFF), ((val>>8)&0xFF), c, c, dest);
But that won't make sense to a human looking at it, and is sensitive to endianness. I'd strongly recommend against it.
In this code what is the role of the symbol %3d? I know that % means refer to a variable.
This is the code:
#include <stdio.h>
int main(void)
{
int t, i, num[3][4];
for(t=0; t<3; ++t)
for(i=0; i<4; ++i)
num[t][i] = (t*4)+i+1;
/* now print them out */
for(t=0; t<3; ++t) {
for(i=0; i<4; ++i)
printf("%3d ", num[t][i]);
printf("\n");
}
return 0;
}
%3d can be broken down as follows:
% means "Print a variable here"
3 means "use at least 3 spaces to display, padding as needed"
d means "The variable will be an integer"
Putting these together, it means "Print an integer, taking minimum 3 spaces"
See http://www.cplusplus.com/reference/clibrary/cstdio/printf/ for more information
That is a format specifier to print a decimal number (d) in three (at least) digits (3).
From man printf:
An optional decimal digit string
specifying a minimum field width. If
the converted value has fewer
characters than the field width, it
will be padded with spaces on the left
(or right, if the left-adjustment flag
has been given) to fill out the field
width.
Take a look here:
Print("%3d",X);
If X is 1234, it prints 1234.
If X is 123, it prints 123.
If X is 12, it prints _12 where _ is a leading single whitespace character.
If X is 1, it prints __1 where __ is two leading whitespacce characters.
An example to enlighten existing answers:
printf("%3d" , x);
When:
x is 1234 prints 1234
x is 123 prints 123
x is 12 prints 12 with an extra padding (space)
x is 1 prints 1 with two extra paddings (spaces)
You can specify the field width between the % and d(for decimal). It represents the total number of characters printed.
A positive value, as mentioned in another answer, right-aligns the output and is the default.
A negative value left-aligns the text.
example:
int a = 3;
printf("|%-3d|", a);
The output:
|3 |
You could also specify the field width as an additional parameter by using the * character:
int a = 3;
printf("|%*d|", 5, a);
which gives:
| 3|
It is a formatting specification. %3d says: print the argument as a decimal, of width 3 digits.
Literally, it means to print an integer padded to three digits with spaces. The % introduces a format specifier, the 3 indicates 3 digits, and the d indicates an integer. Thus, the value of num[t][i] is printed to the screen as a value such as " 1", " 2", " 12", etc.
2/3 or any integer is padding/width .it means ex for 3 ,minimum 3 space if we print a=4 then it print like 4,here two space left before 4 because it is one character