C++ substring multi byte characters - c++

I am having this std::string which contains some characters that span multiple bytes.
When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.
So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.
Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.
Thanks

Simpler version.
based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos
std::string substr(std::string originalString, int maxLength)
{
std::string resultString = originalString;
int len = 0;
int byteCount = 0;
const char* aStr = originalString.c_str();
while(*aStr)
{
if( (*aStr & 0xc0) != 0x80 )
len += 1;
if(len>maxLength)
{
resultString = resultString.substr(0, byteCount);
break;
}
byteCount++;
aStr++;
}
return resultString;
}

A std::string object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring, except that it's a string of 16bit values.
In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.
UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring objects to expect. Assuming that UTF-16 is acceptable you have various choices:
On Windows, you can use the MultiByteToWideChar function; no extra dependencies required.
The UTF8-CPP library claims to provide a lightweight solution for dealing with UTF-* encoded strings. Never tried it myself, but I keep hearing good things about it.
On Linux systems, using the libiconv library is quite common.
If you need to deal with all sorts of crazy encodings and want the full-blown alpha-and-omega word as far as encodings go, look at ICU.

There are really only two possible solutions. If you're doing this a
lot, over large distances, you'd be better off converting your
characters to a single element encoding, using wchar_t (or int32_t,
or whatever is most appropriate. This is not a simple copy, which
would convert each individual char into the target type, but a true
conversion function, which would recognize the multibyte characters, and
convert them into a single element.
For occasional use or shorter sequences, it's possible to write your own
functions for advancing n bytes. For UTF-8, I use the following:
inline size_t
size(
Byte ch )
{
return byteCountTable[ ch ] ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::random_access_iterator_tag )
{
return begin + size ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::input_iterator_tag )
{
while ( size != 0 ) {
++ begin ;
-- size ;
}
return begin ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
InputIterator end )
{
if ( begin != end ) {
begin = succ( begin, end, size( *begin ),
std::::iterator_traits< InputIterator >::iterator_category() ) ;
}
return begin ;
}
template< typename InputIterator >
size_t
characterCount(
InputIterator begin,
InputIterator end )
{
size_t result = 0 ;
while ( begin != end ) {
++ result ;
begin = succ( begin, end ) ;
}
return result ;
}

Based on this I've written my utf8 substring function:
void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
int len = 0, byteIndex = 0;
const char* aStr = originalString.c_str();
size_t origSize = originalString.size();
for (byteIndex=0; byteIndex < origSize; byteIndex++)
{
if((aStr[byteIndex] & 0xc0) != 0x80)
len += 1;
if(len >= SubStrLength)
break;
}
csSubstring = originalString.substr(0, byteIndex);
}

Unicode is hard.
std::wstring is not a list of codepoints, it's a list of wchar_t, and their width is implementation-defined (commonly 16 bits with VC++ and 32 bits with gcc and clang). Yes, it means it's useless for portable code...
A single character may be encoded on several code points (because of diacritics)
In some language, two different characters together form a "unit" that is not really separable (for example, LL is considered a letter on its own in Spanish).
So... it's a bit hard.
Solving 3) may be costly (it requires specific language/usage annotations); solving 1) and 2) is absolutely necessary... and requires Unicode aware libraries or coding your own (and probably getting it wrong).
1) is trivially solved: writing a routine transforming from UTF-8 to CodePoint is trivial (a CodePoint can be represented with an uint32_t)
2) is more difficult, it requires a list of diacritics and the sub routine must know never to cut prior to a diacritic (they follow the character they qualify)
Otherwise, there is probably what you seek in ICU. I wish you good luck finding it.

Let me assume for simplicity that your encoding is UTF-8. In this case we would have some chars occupying more than one byte, as in your case.
Then you have std::string, where those UTF-8 encoded characters are stored.
And now you want to substr() in terms of chars, not bytes.
I'd write a function that will convert character length to byte length. For the utf 8 case it would look like:
#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1
int32 GetByteCountForCharCount(const char* utf8Str, int charCnt)
{
int ByteCount = 0;
for (int i = 0; i < charCnt; i++)
{
int charlen = UTF8_CHAR_LEN(*utf8Str);
ByteCount += charlen;
utf8Str += charlen;
}
return ByteCount;
}
So, say you want to substr() the string from 7-th char. No problem:
int32 pos = GetByteCountForCharCount(str.c_str(), 7);
str.substr(pos);

Related

How to split a string by emojis in C++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Lexicographical sorting for non-ascii characters

I have done lexicographical sorting for ascii characters by the following code:
std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
sort(v.begin(), v.end());
The result should be:
a
aahr
abyutrw
bb
bhehjr
cgh
cuttrew
....
But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!
The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)
However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as
UTF-8 where code points are represented by 1 or more 8 bit values (storage with char)
UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t)
UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).
Concerning the size of wchar_t: Character types.
Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.
The order used in std::sort() to sort a range of T elements is defined by default withbool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.
The custom predicate must match the signature and must provide a strict weak ordering relation.
Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.
This is the predicate, I used in my sample:
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
From bottom to top:
std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
The lambda pred(), compares two std::wstrings character by character.
Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order.
The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.
The type CharMap is simply a typedef:
typedef std::map<wchar_t, unsigned> CharMap;
To initialize a CharMap, a function is used:
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
It has to be called with an array of strings which contains all groups of characters in the intended order:
const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
The complete sample:
#include <string>
#include <sstream>
#include <vector>
static const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
static const wchar_t *tableGerman[] = {
L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
typedef std::map<wchar_t, unsigned> CharMap;
// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
// collect words and sort accoring to table
void printWordsSorted(
const std::wstring &text, const wchar_t *table[], const size_t size)
{
// make look-up table
const CharMap mapChars = makeCharMap(table, size);
// strip punctuation and other noise
std::wstring textClean;
for (const wchar_t chr : text) {
if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
textClean += chr;
}
}
// fill word list with sample text
std::vector<std::wstring> words;
for (std::wistringstream in(textClean);;) {
std::wstring word;
if (!(in >> word)) break; // bail out
// store word
words.push_back(word);
}
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
// remove duplicates
std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
words.erase(last, words.end());
// print result
for (const std::wstring &word : words) {
std::cout << utf8_conv.to_bytes(word) << '\n';
}
}
template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }
int main()
{
// a sample string
std::wstring sampleText
= L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
L" have the same lexicographical rank as their counterparts a, o, and u.\n";
std::cout << "Sample text:\n"
<< utf8_conv.to_bytes(sampleText) << '\n';
// sort like requested by OP
std::cout << "Words of text sorted as requested by OP:\n";
printWordsSorted(sampleText, table, size(table));
// sort like correct in German
std::cout << "Words of text sorted as usual in German language:\n";
printWordsSorted(sampleText, tableGerman, size(tableGerman));
}
Output:
Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut
Live Demo on coliru
Note:
My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.
#Phil1970 reminded me that I forgot to mention something else:
Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.
The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.

utf8 aware strncpy

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.
I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 character into the destination string.
Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).
Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.
glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.
To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 character. (Given valid utf-8 input must never result in having invalid utf-8 output).
Note:
Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.
I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.
char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
if( sizeDest ){
size_t sizeSrc = strlen(src); // number of bytes not including null
while( sizeSrc >= sizeDest ){
const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
while( lastByte-- > src )
if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
break;
sizeSrc = lastByte - src;
}
memcpy(dst, src, sizeSrc);
dst[sizeSrc] = '\0';
}
return dst;
}
I'm not sure what you mean by UTF-8 aware; strncpy copies bytes, not
characters, and the size of the buffer is given in bytes as well. If
what you mean is that it will only copy complete UTF-8 characters,
stopping, for example, if there isn't room for the next character, I'm
not aware of such a function, but it shouldn't be too hard to write:
int
utf8Size( char ch )
{
static int const sizeTable[] =
{
// ...
};
return sizeTable( static_cast<unsigned char>( ch ) )
}
char*
stru8ncpy( char* dest, char* source, int n )
{
while ( *source != '\0' && utf8Size( *source ) < n ) {
n -= utf8Size( *source );
switch ( utf8Size( ch ) ) {
case 6:
*dest ++ = *source ++;
case 5:
*dest ++ = *source ++;
case 4:
*dest ++ = *source ++;
case 3:
*dest ++ = *source ++;
case 2:
*dest ++ = *source ++;
case 1:
*dest ++ = *source ++;
break;
default:
throw IllegalUTF8();
}
}
*dest = '\0';
return dest;
}
(The contents of the table in utf8Size are a bit painful to generate,
but this is a function you'll be using a lot if you're dealing with
UTF-8, and you only have to do it once.)
To reply to own question, heres the C function I ended up with (Not using C++ for this project):
Notes:
- Realize this is not a clone of strncpy for utf8, its more like strlcpy from openbsd.
- utf8_skip_data copied from glib's gutf8.c
- It doesn't validate the utf8 - which is what I intended.
Hope this is useful to others and interested in feedback, but please no pedantic zealot's about NULL termination behavior unless its an actual bug, or misleading/incorrect behavior.
Thanks to James Kanze who provided the basis for this, but was incomplete and C++ (I need a C version).
static const size_t utf8_skip_data[256] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
};
char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
{
char *dst_r = dst;
size_t utf8_size;
if (maxncpy > 0) {
while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
maxncpy -= utf8_size;
switch (utf8_size) {
case 6: *dst ++ = *src ++;
case 5: *dst ++ = *src ++;
case 4: *dst ++ = *src ++;
case 3: *dst ++ = *src ++;
case 2: *dst ++ = *src ++;
case 1: *dst ++ = *src ++;
}
}
*dst= '\0';
}
return dst_r;
}
strncpy() is a terrible function:
If there is insufficient space, the resulting string will not be nul terminated.
If there is enough space, the remainder is filled with NULs. This can be painful if the target string is very big.
Even if the characters stay in the ASCII range (0x7f and below), the resulting string will not be what you want. In the UTF-8 case it might be not nul-terminated and end in an invalid UTF-8 sequence.
Best advice is to avoid strncpy().
EDIT:
ad 1):
#include <stdio.h>
#include <string.h>
int main (void)
{
char buff [4];
strncpy (buff, "hello world!\n", sizeof buff );
printf("%s\n", buff );
return 0;
}
Agreed, the buffer will not be overrun. But the result is still unwanted. strncpy() solves only part of the problem. It is misleading and unwanted.
UPDATE(2012-10-31): Since this is a nasty problem, I decided to hack my own version, mimicking the ugly strncpy() behavior. The return value is the number of characters copied, though..
#include <stdio.h>
#include <string.h>
size_t utf8ncpy(char *dst, char *src, size_t todo);
static int cnt_utf8(unsigned ch, size_t len);
static int cnt_utf8(unsigned ch, size_t len)
{
if (!len) return 0;
if ((ch & 0x80) == 0x00) return 1;
else if ((ch & 0xe0) == 0xc0) return 2;
else if ((ch & 0xf0) == 0xe0) return 3;
else if ((ch & 0xf8) == 0xf0) return 4;
else if ((ch & 0xfc) == 0xf8) return 5;
else if ((ch & 0xfe) == 0xfc) return 6;
else return -1; /* Default (Not in the spec) */
}
size_t utf8ncpy(char *dst, char *src, size_t todo)
{
size_t done, idx, chunk, srclen;
srclen = strlen(src);
for(done=idx=0; idx < srclen; idx+=chunk) {
int ret;
for (chunk=0; done+chunk < todo; chunk++) {
ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) );
if (ret ==1) continue; /* Normal character: collect it into chunk */
if (ret < 0) continue; /* Bad stuff: treat as normal char */
if (ret ==0) break; /* EOF */
if (!chunk) chunk = ret;/* an UTF8 multibyte character */
else ret = 1; /* we allready collected a number (chunk) of normal characters */
break;
}
if (ret > 1 && done+chunk > todo) break;
if (done+chunk > todo) chunk = todo - done;
if (!chunk) break;
memcpy( dst+done, src+idx, chunk);
done += chunk;
if (ret < 1) break;
}
/* This is part of the dreaded strncpy() behavior:
** pad the destination string with NULs
** upto its intended size
*/
if (done < todo) memset(dst+done, 0, todo-done);
return done;
}
int main(void)
{
char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!";
char buffer[30];
unsigned result, len;
for (len = sizeof buffer-1; len < sizeof buffer; len -=3) {
result = utf8ncpy(buffer, string, len);
/* remove the following line to get the REAL strncpy() behaviour */
buffer[result] = 0;
printf("Chop #%u\n", len );
printf("Org:[%s]\n", string );
printf("Res:%u\n", result );
printf("New:[%s]\n", buffer );
}
return 0;
}
Here is a C++ solution:
u8string.h:
#ifndef U8STRING_H
#define U8STRING_H 1
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
/**
* Copies the first few characters of the UTF-8-encoded string pointed to by
* \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in
* <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string
* pointed to by \p str is reached.
*
* The string of bytes that are written into \p dest_buf is NUL terminated
* if \p dest_buf_len is greater than 0.
*
* \returns \p dest_buf
*/
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len);
#ifdef __cplusplus
}
#endif
#endif
u8slbcpy.cpp:
#include "u8string.h"
#include <cstring>
#include <utf8.h>
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len)
{
if (dest_buf_len <= 0) {
return dest_buf;
} else if (dest_buf_len == 1) {
dest_buf[0] = '\0';
return dest_buf;
}
size_t num_bytes_remaining = dest_buf_len - 1;
utf8::unchecked::iterator<const char *> it(src);
const char * prev_base = src;
while (*it++ != '\0') {
const char *base = it.base();
ptrdiff_t diff = (base - prev_base);
if (num_bytes_remaining < diff) {
break;
}
num_bytes_remaining -= diff;
prev_base = base;
}
size_t n = dest_buf_len - 1 - num_bytes_remaining;
std::memmove(dest_buf, src, n);
dest_buf[n] = '\0';
return dest_buf;
}
The function u8slbcpy() has a C interface, but it is implemented in C++. My implementation uses the header-only UTF8-CPP library.
I think that this is pretty much what you are looking for, but note that there is still the problem that one or more combining characters might not be copied if the combining characters apply to the nth character (itself not a combining character) and the destination buffer is just large enough to store the UTF-8 encoding of characters 1 through n, but not the combining characters of character n. In this case, the bytes representing characters 1 through n are written, but none of the combining characters of n are. In effect, you could say that the nth character is partially written.
To comment on the above answer "strncpy() is a terrible function:".
I hate to even comment on such blanket statements at the expense of creating yet another internet programming jihad, but will anyhow since statements like this are misleading to those that might come here to look for answers.
Okay maybe C string functions are "old school". Maybe all strings in C/C++ should be in some kind of smart containers, etc., maybe one should use C++ instead of C (when you have a choice), these are more of a preference and an argument for other topics.
I came here looking for a UTF-8 strncpy() my self. Not that I couldn't make one (the encoding is IMHO simple and elegant) but wanted to see how others made theirs and perhaps find a optimized in ASM one.
To the "gods gift" of the programming world people, put your hubris aside for a moment and look at some facts.
There is nothing wrong with "strncpy()", or any other of the similar functions with the same side effects and issues like "_snprintf()", etc.
I say: "strncpy() is not terrible", but rather "terrible programmers use it terribly".
What is "terrible" is not knowing the rules.
Furthermore on the whole subject because of security (like buffer overrun) and program stability implications, there wouldn't be a need for example Microsoft to add to it's CRT lib "Safe String Functions" if the rules were just followed.
The main ones:
"sizeof()" returns the length of a static string w/terminator.
"strlen()" returns the length of string w/o terminator.
Most if no all "n" functions just clamp to 'n' with out adding a terminator.
There is implicit ambiguity on what "buffer size" is in functions that require and input buffer size. I.E. The "(char *pszBuffer, int iBufferSize)" types.
Safer to assume the worst and pass a size one less then the actual buffer size, and adding a terminator at the end to be sure.
For string inputs, buffers, etc., set and use a reasonable size limit based on expected average and maximum. To hopefully avoid input truncation, and to eliminate buffer overruns period.
This is how I personally handle such things, and other rules that are just to be known and practiced.
A handy macro for static string size:
// Size of a string with out terminator
#define SIZESTR(x) (sizeof(x) - 1)
When declaring local/stack string buffers:
A) The size for example limited to 1023+1 for terminator to allow for strings up to 1023 chars in length.
B) I'm initializing the the string to zero in length, plus terminating at the very end to cover a possible 'n' truncation.
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
Alternately one could do just:
char szBuffer[1024] = {0};
of course but then there is some performance implication for a compiler generated "memset() like call to zero the whole buffer. It makes things cleaner for debugging though, and I prefer this style for static (vs local/stack) strings buffers.
Now a "strncpy()" following the rules:
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer));
There are other "rules" and issues of course, but these are the main ones that come to mind.
You just got to know how the lib functions work and to use safe practices like this.
Finally in my project I use ICU anyhow so I decided to go with it and use the macros in "utf8.h" to make my own "strncpy()".

Trimming UTF8 buffer

I have a buffer with UTF8 data. I need to remove the leading and trailing spaces.
Here is the C code which does it (in place) for ASCII buffer:
char *trim(char *s)
{
while( isspace(*s) )
memmove( s, s+1, strlen(s) );
while( *s && isspace(s[strlen(s)-1]) )
s[strlen(s)-1] = 0;
return s;
}
How to do the same for UTF8 buffer in C/C++?
P.S.
Thanks for perfomance tip regarding strlen(). Back to UTF8 specific: what if I need to remove all spaces all together, not only at beginning and at the tail? Also I may need to remove all characters with ASCII code <32. Is any specific here for UTF8 case, like using mbstowcs()?
Do you want to remove all of the various Unicode spaces too, or just ASCII spaces? In the latter case you don't need to modify the code at all.
In any case, the method you're using that repeatedly calls strlen is extremely inefficient. It turns a simple O(n) operation into at least O(n^2).
Edit: Here's some code for your updated problem, assuming you only want to strip ASCII spaces and control characters:
unsigned char *in, *out;
for (out = in; *in; in++) if (*in > 32) *out++ = *in;
*out = 0;
strlen() scans to the end of the string, so calling it multiple times, as in your code, is very inefficient.
Try looking for the first non-space and the last non-space and then memmove the substring:
char *trim(char *s)
{
char *first;
char *last;
first = s;
while(isspace(*first))
++first;
last = first + strlen(first) - 1;
while(last > first && isspace(*last))
--last;
memmove(s, first, last - first + 1);
s[last - first + 1] = '\0';
return s;
}
Also remember that the code modifies its argument.

Convert wchar_t to char

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.
Why not just use a library routine wcstombs.
assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.
You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.
An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)
A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}
Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.
Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}
one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];
In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65