Lexicographical sorting for non-ascii characters - c++

I have done lexicographical sorting for ascii characters by the following code:
std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
sort(v.begin(), v.end());
The result should be:
a
aahr
abyutrw
bb
bhehjr
cgh
cuttrew
....
But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!

The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)
However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as
UTF-8 where code points are represented by 1 or more 8 bit values (storage with char)
UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t)
UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).
Concerning the size of wchar_t: Character types.
Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.
The order used in std::sort() to sort a range of T elements is defined by default withbool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.
The custom predicate must match the signature and must provide a strict weak ordering relation.
Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.
This is the predicate, I used in my sample:
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
From bottom to top:
std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
The lambda pred(), compares two std::wstrings character by character.
Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order.
The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.
The type CharMap is simply a typedef:
typedef std::map<wchar_t, unsigned> CharMap;
To initialize a CharMap, a function is used:
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
It has to be called with an array of strings which contains all groups of characters in the intended order:
const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
The complete sample:
#include <string>
#include <sstream>
#include <vector>
static const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
static const wchar_t *tableGerman[] = {
L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
typedef std::map<wchar_t, unsigned> CharMap;
// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
// collect words and sort accoring to table
void printWordsSorted(
const std::wstring &text, const wchar_t *table[], const size_t size)
{
// make look-up table
const CharMap mapChars = makeCharMap(table, size);
// strip punctuation and other noise
std::wstring textClean;
for (const wchar_t chr : text) {
if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
textClean += chr;
}
}
// fill word list with sample text
std::vector<std::wstring> words;
for (std::wistringstream in(textClean);;) {
std::wstring word;
if (!(in >> word)) break; // bail out
// store word
words.push_back(word);
}
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
// remove duplicates
std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
words.erase(last, words.end());
// print result
for (const std::wstring &word : words) {
std::cout << utf8_conv.to_bytes(word) << '\n';
}
}
template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }
int main()
{
// a sample string
std::wstring sampleText
= L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
L" have the same lexicographical rank as their counterparts a, o, and u.\n";
std::cout << "Sample text:\n"
<< utf8_conv.to_bytes(sampleText) << '\n';
// sort like requested by OP
std::cout << "Words of text sorted as requested by OP:\n";
printWordsSorted(sampleText, table, size(table));
// sort like correct in German
std::cout << "Words of text sorted as usual in German language:\n";
printWordsSorted(sampleText, tableGerman, size(tableGerman));
}
Output:
Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut
Live Demo on coliru
Note:
My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.
#Phil1970 reminded me that I forgot to mention something else:
Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.
The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.

Related

How to get the name of a Unicode character?

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.
For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".
Or if there are some other way to get the same result from Windows with C or C++.
The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).
You can use undocumented GetUName method from getuname.dll:
std::string GetUnicodeCharacterName(wchar_t character)
{
// https://github.com/reactos/reactos/tree/master/dll/win32/getuname
typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));
if (!pfnGetUName)
return {};
std::array<WCHAR, 256> buffer;
int length = pfnGetUName(character, buffer.data());
return utf8::narrow(buffer.data(), length);
}
// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
if (!std::iswgraph(character))
{
if (character <= 0x21)
character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
else
character = 0xFFFD; // REPLACEMENT CHARACTER
}
return character;
}
// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
// 𐌸 <U+10338 Supplementary Multilingual Plane>
// 🚒 <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
// UTF-8 <=> UTF-32 converter
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;
// UTF-8 to UTF-32
std::u32string utf32string = utf32conv.from_bytes(string);
std::string characterNames;
characterNames.reserve(35 * utf32string.size());
for (const char32_t& codePoint : utf32string)
{
if (!characterNames.empty())
characterNames.append(", ");
char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";
// UTF-32 to UTF-8
std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
}
return characterNames;
}
The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).
Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):
std::string GetUCharNameWrapper(char32_t codePoint)
{
typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));
if (!pfnU_charName)
return {};
int errorCode = 0;
std::array<char, 512> buffer;
int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);
if (errorCode != 0)
return {};
return std::string(buffer.data(), length);
}

How to split a string by emojis in C++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each.
I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS.
Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.
Example: "OΛΑ":
I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".
PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".
Given that there are so few values to map, a simple solution is to use a lookup table.
Pseudocode:
id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8
c1_offset = 0x20 // 0x80 .. 0x9F control characters
table_offset = id_offset + c1_offset
table = [
u8"\u00A0", // 0xA0
u8"‘", // 0xA1
u8"’",
u8"£",
u8"€",
u8"₯",
// ... Refer to ISO 8859-7 for full list of characters.
]
let S be the input string
let O be an empty output string
for each char C in S
reinterpret C as unsigned char U
if U less than id_offset // same in both encodings
append C to O
else if U less than table_offset // control code
append char '\xC2' to O // lead byte
append char C to O
else
append string table[U - table_offset] to O
All that said, I recommend to save some time by using a library instead.
One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.
Converting may be as simple as this:
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
// A wrapper for the iconv functions
class Conv {
public:
// Open a conversion descriptor for the two selected character sets
Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
if(cd == reinterpret_cast<iconv_t>(-1))
throw std::runtime_error(std::strerror(errno));
}
Conv(const Conv&) = delete;
~Conv() { iconv_close(cd); }
// the actual conversion function
std::string convert(const std::string& in) {
const char* inbuf = in.c_str();
size_t inbytesleft = in.size();
// make the "out" buffer big to fit whatever we throw at it and set pointers
std::string out(inbytesleft * 6, '\0');
char* outbuf = out.data();
size_t outbytesleft = out.size();
// the const_cast shouldn't be needed but my "iconv" function declares it
// "char**" not "const char**"
size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
&inbytesleft, &outbuf, &outbytesleft);
if(non_rev_converted == static_cast<size_t>(-1)) {
// here you can add misc handling like replacing erroneous chars
// and continue converting etc.
// I'll just throw...
throw std::runtime_error(std::strerror(errno));
}
// shrink to keep only what we converted
out.resize(outbuf - out.data());
return out;
}
private:
iconv_t cd;
};
int main() {
Conv cvt("UTF-8", "ISO-8859-7");
// create a string from the ISO-8859-7 data
unsigned char data[]{0xcf, 0xcb, 0xc1};
std::string iso88597_str(std::begin(data), std::end(data));
auto utf8 = cvt.convert(iso88597_str);
std::cout << utf8 << '\n';
}
Output (in UTF-8):
ΟΛΑ
Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:
Demo
Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.
The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.
bool iso_to_utf8(char* in){
bool wasISO=false;
if(in == NULL)
return wasISO;
// count chars
int i=strlen(in);
if(!i)
return wasISO;
// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);
// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);
return wasISO;
}

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

C++ substring multi byte characters

I am having this std::string which contains some characters that span multiple bytes.
When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.
So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.
Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.
Thanks
Simpler version.
based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos
std::string substr(std::string originalString, int maxLength)
{
std::string resultString = originalString;
int len = 0;
int byteCount = 0;
const char* aStr = originalString.c_str();
while(*aStr)
{
if( (*aStr & 0xc0) != 0x80 )
len += 1;
if(len>maxLength)
{
resultString = resultString.substr(0, byteCount);
break;
}
byteCount++;
aStr++;
}
return resultString;
}
A std::string object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring, except that it's a string of 16bit values.
In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.
UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring objects to expect. Assuming that UTF-16 is acceptable you have various choices:
On Windows, you can use the MultiByteToWideChar function; no extra dependencies required.
The UTF8-CPP library claims to provide a lightweight solution for dealing with UTF-* encoded strings. Never tried it myself, but I keep hearing good things about it.
On Linux systems, using the libiconv library is quite common.
If you need to deal with all sorts of crazy encodings and want the full-blown alpha-and-omega word as far as encodings go, look at ICU.
There are really only two possible solutions. If you're doing this a
lot, over large distances, you'd be better off converting your
characters to a single element encoding, using wchar_t (or int32_t,
or whatever is most appropriate. This is not a simple copy, which
would convert each individual char into the target type, but a true
conversion function, which would recognize the multibyte characters, and
convert them into a single element.
For occasional use or shorter sequences, it's possible to write your own
functions for advancing n bytes. For UTF-8, I use the following:
inline size_t
size(
Byte ch )
{
return byteCountTable[ ch ] ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::random_access_iterator_tag )
{
return begin + size ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::input_iterator_tag )
{
while ( size != 0 ) {
++ begin ;
-- size ;
}
return begin ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
InputIterator end )
{
if ( begin != end ) {
begin = succ( begin, end, size( *begin ),
std::::iterator_traits< InputIterator >::iterator_category() ) ;
}
return begin ;
}
template< typename InputIterator >
size_t
characterCount(
InputIterator begin,
InputIterator end )
{
size_t result = 0 ;
while ( begin != end ) {
++ result ;
begin = succ( begin, end ) ;
}
return result ;
}
Based on this I've written my utf8 substring function:
void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
int len = 0, byteIndex = 0;
const char* aStr = originalString.c_str();
size_t origSize = originalString.size();
for (byteIndex=0; byteIndex < origSize; byteIndex++)
{
if((aStr[byteIndex] & 0xc0) != 0x80)
len += 1;
if(len >= SubStrLength)
break;
}
csSubstring = originalString.substr(0, byteIndex);
}
Unicode is hard.
std::wstring is not a list of codepoints, it's a list of wchar_t, and their width is implementation-defined (commonly 16 bits with VC++ and 32 bits with gcc and clang). Yes, it means it's useless for portable code...
A single character may be encoded on several code points (because of diacritics)
In some language, two different characters together form a "unit" that is not really separable (for example, LL is considered a letter on its own in Spanish).
So... it's a bit hard.
Solving 3) may be costly (it requires specific language/usage annotations); solving 1) and 2) is absolutely necessary... and requires Unicode aware libraries or coding your own (and probably getting it wrong).
1) is trivially solved: writing a routine transforming from UTF-8 to CodePoint is trivial (a CodePoint can be represented with an uint32_t)
2) is more difficult, it requires a list of diacritics and the sub routine must know never to cut prior to a diacritic (they follow the character they qualify)
Otherwise, there is probably what you seek in ICU. I wish you good luck finding it.
Let me assume for simplicity that your encoding is UTF-8. In this case we would have some chars occupying more than one byte, as in your case.
Then you have std::string, where those UTF-8 encoded characters are stored.
And now you want to substr() in terms of chars, not bytes.
I'd write a function that will convert character length to byte length. For the utf 8 case it would look like:
#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1
int32 GetByteCountForCharCount(const char* utf8Str, int charCnt)
{
int ByteCount = 0;
for (int i = 0; i < charCnt; i++)
{
int charlen = UTF8_CHAR_LEN(*utf8Str);
ByteCount += charlen;
utf8Str += charlen;
}
return ByteCount;
}
So, say you want to substr() the string from 7-th char. No problem:
int32 pos = GetByteCountForCharCount(str.c_str(), 7);
str.substr(pos);