How to split a string by emojis in C++

How to split a string by emojis in C++ - c++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji

I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Related

Generate string by switching ASCII code (extended)

I have a list of strings that were "encrypted" with a simple char code switching. To make them readable I need to increase each character code by 7. The problem is that those codes can be above 127 (extended ASCII). I'm iterating each seed string and trying to produce a new string like this
string result = "";
for(std::string::iterator it = myEncryptedWord.begin(); it != myEncryptedWord.end(); ++it)
{
unsigned char myC = *it;
result+= char(myC+7);
}
It does not work for codes above 127 (characters like áóõã...). Needless to say that I can't control the original string list and change this poor "encryption".

How to get the name of a Unicode character?

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.
For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".
Or if there are some other way to get the same result from Windows with C or C++.
The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).

You can use undocumented GetUName method from getuname.dll:
std::string GetUnicodeCharacterName(wchar_t character)
{
// https://github.com/reactos/reactos/tree/master/dll/win32/getuname
typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));
if (!pfnGetUName)
return {};
std::array<WCHAR, 256> buffer;
int length = pfnGetUName(character, buffer.data());
return utf8::narrow(buffer.data(), length);
}
// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
if (!std::iswgraph(character))
{
if (character <= 0x21)
character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
else
character = 0xFFFD; // REPLACEMENT CHARACTER
}
return character;
}
// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
// 𐌸 <U+10338 Supplementary Multilingual Plane>
// 🚒 <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
// UTF-8 <=> UTF-32 converter
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;
// UTF-8 to UTF-32
std::u32string utf32string = utf32conv.from_bytes(string);
std::string characterNames;
characterNames.reserve(35 * utf32string.size());
for (const char32_t& codePoint : utf32string)
{
if (!characterNames.empty())
characterNames.append(", ");
char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";
// UTF-32 to UTF-8
std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
}
return characterNames;
}
The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).
Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):
std::string GetUCharNameWrapper(char32_t codePoint)
{
typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));
if (!pfnU_charName)
return {};
int errorCode = 0;
std::array<char, 512> buffer;
int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);
if (errorCode != 0)
return {};
return std::string(buffer.data(), length);
}

Lexicographical sorting for non-ascii characters

I have done lexicographical sorting for ascii characters by the following code:
std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
sort(v.begin(), v.end());
The result should be:
a
aahr
abyutrw
bb
bhehjr
cgh
cuttrew
....
But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!

The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Characters like À, Á, and Â are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)
However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as
UTF-8 where code points are represented by 1 or more 8 bit values (storage with char)
UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t)
UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).
Concerning the size of wchar_t: Character types.
Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.
The order used in std::sort() to sort a range of T elements is defined by default withbool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.
The custom predicate must match the signature and must provide a strict weak ordering relation.
Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.
This is the predicate, I used in my sample:
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
From bottom to top:
std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
The lambda pred(), compares two std::wstrings character by character.
Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order.
The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.
The type CharMap is simply a typedef:
typedef std::map<wchar_t, unsigned> CharMap;
To initialize a CharMap, a function is used:
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
It has to be called with an array of strings which contains all groups of characters in the intended order:
const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
The complete sample:
#include <string>
#include <sstream>
#include <vector>
static const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
static const wchar_t *tableGerman[] = {
L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
typedef std::map<wchar_t, unsigned> CharMap;
// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
// collect words and sort accoring to table
void printWordsSorted(
const std::wstring &text, const wchar_t *table[], const size_t size)
{
// make look-up table
const CharMap mapChars = makeCharMap(table, size);
// strip punctuation and other noise
std::wstring textClean;
for (const wchar_t chr : text) {
if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
textClean += chr;
}
}
// fill word list with sample text
std::vector<std::wstring> words;
for (std::wistringstream in(textClean);;) {
std::wstring word;
if (!(in >> word)) break; // bail out
// store word
words.push_back(word);
}
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
// remove duplicates
std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
words.erase(last, words.end());
// print result
for (const std::wstring &word : words) {
std::cout << utf8_conv.to_bytes(word) << '\n';
}
}
template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }
int main()
{
// a sample string
std::wstring sampleText
= L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
L" have the same lexicographical rank as their counterparts a, o, and u.\n";
std::cout << "Sample text:\n"
<< utf8_conv.to_bytes(sampleText) << '\n';
// sort like requested by OP
std::cout << "Words of text sorted as requested by OP:\n";
printWordsSorted(sampleText, table, size(table));
// sort like correct in German
std::cout << "Words of text sorted as usual in German language:\n";
printWordsSorted(sampleText, tableGerman, size(tableGerman));
}
Output:
Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut
Live Demo on coliru
Note:
My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.
#Phil1970 reminded me that I forgot to mention something else:
Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.
The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.

How to remove the last character of a UTF-8 string in C++?

The text is stored in a std::string.
If the text is 8-bit ASCII, then it is really easy:
text.pop_back();
But what if it is UTF-8 text?
As far as I know, there are no UTF-8 related functions in the standard library which I could use.

You really need a UTF-8 Library if you are going to work with UTF-8. However for this task I think something like this may suffice:
void pop_back_utf8(std::string& utf8)
{
if(utf8.empty())
return;
auto cp = utf8.data() + utf8.size();
while(--cp >= utf8.data() && ((*cp & 0b10000000) && !(*cp & 0b01000000))) {}
if(cp >= utf8.data())
utf8.resize(cp - utf8.data());
}
int main()
{
std::string s = "κόσμε";
while(!s.empty())
{
std::cout << s << '\n';
pop_back_utf8(s);
}
}
Output:
κόσμε
κόσμ
κόσ
κό
κ
It relies on the fact that UTF-8 Encoding has one start byte followed by several continuation bytes. Those continuation bytes can be detected using the provided bitwise operators.

What you can do is pop off characters until you reach the leading byte of a code point. The leading byte of a code point in UTF8 is either of the pattern 0xxxxxxx or 11xxxxxx, and all non-leading bytes are of the form 10xxxxxx. This means you can check the first and second bit to determine if you have a leading byte.
bool is_leading_utf8_byte(char c) {
auto first_bit_set = (c & 0x80) != 0;
auto second_bit_set = (c & 0X40) != 0;
return !first_bit_set || second_bit_set;
}
void pop_utf8(std::string& x) {
while (!is_leading_utf8_byte(x.back()))
x.pop_back();
x.pop_back();
}
This of course does no error checking and assumes that your string is valid utf-8.

Splitting a char array by delimeter, then saving the result?

I need to be able to parse the following two strings in my program:
cat myfile || sort
more myfile || grep DeKalb
The string is being saved in char buffer[1024]. What I need to end up with is a pointer to a char array for the left side, and a pointer to a char array for the right side so that I can use these to call the following for each side:
int execvp(const char *file, char *const argv[]);
Anyone have any ideas as to how I can get the right arguments for the execvp command if the two strings above are saved in a character buffer char buffer[1024]; ?
I need char *left to hold the first word of the left side, then char *const leftArgv[] to hold both words on the left side. Then I need the same thing for the right. I have been messing around with strtok for like two hours now and I am hitting a wall. Anyone have any ideas?

I recommend you to learn more about regular expressions. And in order to solve your problem painlessly, you could utilize the Boost.Regex library which provides a powerful regular expression engine. The solution would be just several lines of code, but I encourage you to do it yourself - that would be a good exercise. If you still have problems, come back with some results and clearly state where you were stuck.

You could use std::getline(stream, stringToReadInto, delimeter).
I personally use my own function, which has some addition features baked into it, that looks like this:
StringList Seperate(const std::string &str, char divider, SeperationFlags seperationFlags, CharValidatorFunc whitespaceFunc)
{
return Seperate(str, CV_IS(divider), seperationFlags, whitespaceFunc);
}
StringList Seperate(const std::string &str, CharValidatorFunc isDividerFunc, SeperationFlags seperationFlags, CharValidatorFunc whitespaceFunc)
{
bool keepEmptySegments = (seperationFlags & String::KeepEmptySegments);
bool keepWhitespacePadding = (seperationFlags & String::KeepWhitespacePadding);
StringList stringList;
size_t startOfSegment = 0;
for(size_t pos = 0; pos < str.size(); pos++)
{
if(isDividerFunc(str[pos]))
{
//Grab the past segment.
std::string segment = str.substr(startOfSegment, (pos - startOfSegment));
if(!keepWhitespacePadding)
{
segment = String::RemovePadding(segment);
}
if(keepEmptySegments || !segment.empty())
{
stringList.push_back(segment);
}
//If we aren't keeping empty segments, speedily check for multiple seperators in a row.
if(!keepEmptySegments)
{
//Keep looping until we don't find a divider.
do
{
//Increment and mark this as the (potential) beginning of a new segment.
startOfSegment = ++pos;
//Check if we've reached the end of the string.
if(pos >= str.size())
{
break;
}
}
while(isDividerFunc(str[pos]));
}
else
{
//Mark the beginning of a new segment.
startOfSegment = (pos + 1);
}
}
}
//The final segment.
std::string lastSegment = str.substr(startOfSegment, (str.size() - startOfSegment));
if(keepEmptySegments || !lastSegment.empty())
{
stringList.push_back(lastSegment);
}
return stringList;
}
Where 'StringList' is a typedef of std::vector, and CharValidatorFunc is a function pointer (actually, std::function to allow functor and lambda support) for a function taking one char, and returning a bool. it can be used like so:
StringList results = String::Seperate(" Meow meow , Green, \t\t\nblue\n \n, Kitties!", ',' /* delimeter */, DefaultFlags, is_whitespace);
And would return the results:
{"Meow meow", "Green", "blue", "Kitties!"}
Preserving the internal whitespace of 'Meow meow', but removing the spaces and tabs and newlines surrounding the variables, and splitting upon commas.
(CV_IS is a functor object for matching a specific char or a specific collection of chars taken as a string-literal. I also have CV_AND and CV_OR for combining char validator functions)
For a string literal, I'd just toss it into a std::string() and then pass it to the function, unless extreme performance is required. Breaking on delimeters is fairly easy to roll your own - the above function is just customized to my projects' typical usage and requirements, but feel free to modify it and claim it for yourself.

In case this gives anyone else grief, this is how I solved the problem:
//variables for the input and arguments
char *command[2];
char *ptr;
char *LeftArg[3];
char *RightArg[3];
char buf[1024]; //input buffer
//parse left and right of the ||
number = 0;
command[0] = strtok(buf, "||");
//split left and right
while((ptr=strtok(NULL, "||")) != NULL)
{
number++;
command[number]=ptr;
}
//parse the spaces out of the left side
number = 0;
LeftArg[0] = strtok(command[0], " ");
//split the arguments
while((ptr=strtok(NULL, " ")) != NULL)
{
number++;
LeftArg[number]=ptr;
}
//put null at the end of the array
number++;
LeftArg[number] = NULL;
//parse the spaces out of the right side
number = 0;
RightArg[0] = strtok(command[1], " ");
//split the arguments
while((ptr=strtok(NULL, " ")) != NULL)
{
number++;
RightArg[number]=ptr;
}
//put null at the end of the array
number++;
RightArg[number] = NULL;
Now you can use LeftArg and RightArg in the command, after you get the piping right
execvp(LeftArg[0], LeftArg);//execute left side of the command
Then pipe to the right side of the command and do
execvp(RightArg[0], RightArg);//execute right side of command

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to split a string by emojis in C++ - c++

Related

Generate string by switching ASCII code (extended)

How to get the name of a Unicode character?

Lexicographical sorting for non-ascii characters

How to remove the last character of a UTF-8 string in C++?

Splitting a char array by delimeter, then saving the result?

Categories

Resources