This is my code:
std::vector<std::string> InverseIndex::getWords(std::string line)
{
std::vector<std::string> words;
char* str = (char*)line.c_str();
char* end = str + strlen(str) + 1;
unsigned char symbol[5] = {0,0,0,0,0};
while( str < end ){
utf8::uint32_t code = utf8::next(str, end);
if(code == 0) continue;
utf8::append(code, symbol);
// TODO detect white spaces or numbers.
std::string word = (const char*)symbol;
words.push_back(word);
}
return words;
}
Input : "你 好 啊 哈哈 1234"
Output :
你
??
好
??
啊
??
哈
哈
??
1??
2??
3??
4??
Expected output :
你
好
啊
哈
哈
Is there anyway to skip the white space or numbers , thanks?
UTF8-CPP is nothing more than a tool for encoding and decoding strings into/outof UTF-8. Classification of Unicode codepoints is well outside the scope of that tool. You'll need to use a serious localization tool like Boost.Locale or ICU for that.
UTF-8 is "ASCII compatible" in the following sense:
If one of the bytes of the encoded string is equal to ASCII value - such as space, new line, or digits 0-9, this means that it is not a part of encoded sequence longer than a byte. It is actually this very character.
This means, that you can do isdigit() on a byte in UTF8 string as if it was an ASCII string, and it is guaranteed to work correctly.
For more information, see http://utf8everywhere.org the section on search.
Related
I have a list of strings that were "encrypted" with a simple char code switching. To make them readable I need to increase each character code by 7. The problem is that those codes can be above 127 (extended ASCII). I'm iterating each seed string and trying to produce a new string like this
string result = "";
for(std::string::iterator it = myEncryptedWord.begin(); it != myEncryptedWord.end(); ++it)
{
unsigned char myC = *it;
result+= char(myC+7);
}
It does not work for codes above 127 (characters like áóõã...). Needless to say that I can't control the original string list and change this poor "encryption".
I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.
I have written a parser that turns out works incorrectly with UTF-8 texts.
The parser is very very simple:
while(pos < end) {
// find some ASCII char
if (text.at(pos) == '#') {
// Check some conditions and if the syntax is wrong...
if (...)
createDiagnostic(pos);
}
pos++;
}
So you can see I am creating a diagnostic at pos. But that pos is wrong if there were some UTF-8 characters (because UTF-8 characters in reality consists of more than one char. How do I correctly skip the UTF-8 chars as if they are one character?
I need this because the diagnostics are sent to UTF-8-aware VSCode.
I tried to read some articles on UTF-8 in C++ but every material I found is huge. And I only need to skip the UTF-8.
If the code point is less than 128, then UTF-8 encodes it as ASCII (No highest bit set). If code point is equal or larger than 128, all the encoded bytes will have the highest bit set. So, this will work:
unsigned char b = <...>; // b is a byte from a utf-8 string
if (b&0x80) {
// ignore it, as b is part of a >=128 codepoint
} else {
// use b as an ASCII code
}
Note: if you want to calculate the number of UTF-8 codepoints in a string, then you have to count bytes with:
!(b&0x80): this means that the byte is an ASCII character, or
(b&0xc0)==0xc0: this means, that the byte is the first byte of a multi-byte UTF8-sequence
for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.
I have a buffer with UTF8 data. I need to remove the leading and trailing spaces.
Here is the C code which does it (in place) for ASCII buffer:
char *trim(char *s)
{
while( isspace(*s) )
memmove( s, s+1, strlen(s) );
while( *s && isspace(s[strlen(s)-1]) )
s[strlen(s)-1] = 0;
return s;
}
How to do the same for UTF8 buffer in C/C++?
P.S.
Thanks for perfomance tip regarding strlen(). Back to UTF8 specific: what if I need to remove all spaces all together, not only at beginning and at the tail? Also I may need to remove all characters with ASCII code <32. Is any specific here for UTF8 case, like using mbstowcs()?
Do you want to remove all of the various Unicode spaces too, or just ASCII spaces? In the latter case you don't need to modify the code at all.
In any case, the method you're using that repeatedly calls strlen is extremely inefficient. It turns a simple O(n) operation into at least O(n^2).
Edit: Here's some code for your updated problem, assuming you only want to strip ASCII spaces and control characters:
unsigned char *in, *out;
for (out = in; *in; in++) if (*in > 32) *out++ = *in;
*out = 0;
strlen() scans to the end of the string, so calling it multiple times, as in your code, is very inefficient.
Try looking for the first non-space and the last non-space and then memmove the substring:
char *trim(char *s)
{
char *first;
char *last;
first = s;
while(isspace(*first))
++first;
last = first + strlen(first) - 1;
while(last > first && isspace(*last))
--last;
memmove(s, first, last - first + 1);
s[last - first + 1] = '\0';
return s;
}
Also remember that the code modifies its argument.