I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/
Related
for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.
I need some clarifications.
The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.
To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16.
So, the questions:
What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?
Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?
Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?
Disclaimer: All Chinese words provided by GOOGLE translate service.
1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.
2) For example:
#include <string>
#include <vector>
#include <iostream>
std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";
int main(int argc, char* argv[])
{
const std::vector<std::string> args(argv + 1, argv + argc);
bool do_switch = false;
std::string option;
for(auto arg = args.begin(); arg != args.end(); ++arg)
{
if(*arg == "--" + arg_switch)
do_switch = true;
else if(*arg == "--" + arg_option)
{
if(++arg == args.end())
{
// option needs a value - not found
std::cout << arg_option_error << '\n';
return 1;
}
option = *arg;
}
}
std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
std::cout << arg_option << ": " << option << '\n';
return 0;
}
Usage:
./program --开关 --选项 wibble
Output:
开关: on
选项: wibble
3) No.
For UTF-8/UTF-16 data we need to use special libraries like ICU
For character by character processing you need to use or convert to UTF-32.
In short:
int main(int argc, char **argv) {
setlocale(LC_CTYPE, "");
// ...
}
http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3
And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.
1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*
2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.
3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.
You could be interested in this other SO question.
Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:
( ((u & 0xE0) == 0xC0)
|| ((u & 0xF0) == 0xE0)
|| ((u & 0xF8) == 0xF0)
|| ((u & 0xFC) == 0xF8)
|| ((u & 0xFE) == 0xFC) )
It is followed by one or several chars such as:
((u & 0xC0) == 0x80)
All other chars are ASCII chars (i.e. not multibyte).
I a writing a program and I need to write a function that returns the amount of characters and spaced in a string. I have a string(mystring) that the user writes, I want the function to return the exact amount of letters and spaces in string, for examples "Hello World" should return 11, since there are 10 letters and 1 space. I know string::size exists but this returns the size in bytes, which is of no use to me.
I'm not sure if you want the length of the string in characters or you just want to count the number of letters and spaces.
There is no specific function that lets you count just letters and spaces, however you can get the amount of letters and spaces (and ignore all other types of characters) quite simply:
#include <string>
#include <algorithm>
#include <cctype>
int main() {
std::string mystring = "Hello 123 World";
int l = std::count_if(mystring.begin(), mystring.end(), [](char c){ return isspace(c) || isalpha(c); });
return 0;
}
Otherwise, unless you use non-ascii strings, std::string::length should work for you.
In general, it's not so simple and you're quite right if you assumed that one byte doesn't necessarily mean one character. However, if you're just learning, you don't have to deal with unicode and the accompanying nastiness yet. For now you can assume 1 byte is 1 character, just know that it's not generally true.
Your first aim should be to figure out if the string is ascii encoded or encoded with a multi-byte format.
For ascii string::size would suffice. You could use the length property of string as well.
In the latter case you need to find the number of bytes per character.
You should take the size of your array, in bytes, using string::size and then divide this by the size in bytes of an element of that string (a char).
That would look like: int len = mystring.size() / sizeof(char);
Just make sure to include iostream, the header file that contains std::sizeof.
You can make your own function to get the length of string in C++ (For std::string)
#include <iostream>
#include <cstring>
using namespace std;
int get_len(string str){
int len = 0;
char *ptr;
while(*ptr != '\0')
{
ptr = &str[len];
len++;
}
int f_len = len - 1;
return f_len;
}
To use this function, simply use:
get_len("str");
I don't know how to solve that:
Imagine, we have 4 websites:
A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters.
Then I send these words to a server. The database and the web-frontend are using UTF-8.
So my questions are:
How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
How can I convert "any" (or the most used) character encoding to UTF-8?
ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).
You create a converter for a given encoding...
#include <ucnv.h>
UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
// ...
ucnv_close( converter );
}
...and then use the UnicodeString class as appripriate.
I think wchar_t does not work because it is 2 bytes long.
The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
Check BreakIterator.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.
No sure if this will give you everything you're looking for but it might help a little.
Have you tried looking at:
1) Boost.Locale library ?
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
2) Or at
conversions are part of C++11?
#include <codecvt>
#include <locale>
#include <string>
#include <cassert>
int main() {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
std::string utf8 = convert.to_bytes(0x5e9);
assert(utf8.length() == 2);
assert(utf8[0] == '\xD7');
assert(utf8[1] == '\xA9');
}
How can I work with UTF-8-strings in C++? I think wchar_t does not
work because it is 2 bytes long. Code-Points in UTF-8 are up to 4
bytes long...
This is easy, there is a project named tinyutf8 , which is a drop-in replacement for std::string/std::wstring.
Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.
How can I convert "any" (or the most used) character encoding to
UTF-8?
You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).
UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).
It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.
Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.
Content-Type: text/html; charset=UTF-8
If not present, there also there might be
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8"> <!-- HTML5 -->
ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such.
Even browsers on MacOS will understand these though ISO-8859-1 was specified.
Conversion libraries: alread mentioned by #syam.
Conversion
Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.
The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.
table[157] = "\xEF\xBF\xBD";
public static void main(String[] args) {
final String SOURCE_ENCODING = "windows-1252";
byte[] sourceBytes = new byte[1];
System.out.println(" const char* table[] = {");
for (int c = 0; c < 256; ++c) {
String comment = "";
System.out.printf(" /* %3d */ \"", c);
if (32 <= c && c < 127) {
// Pure ASCII
if (c == '\"' || c == '\\')
System.out.print("\\");
System.out.print((char)c);
} else {
if (c == 0) {
comment = " // Unusable";
}
sourceBytes[0] = (byte)c;
try {
byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
for (int j = 0; j < targetBytes.length; ++j) {
int b = targetBytes[j] & 0xFF;
System.out.printf("\\x%02X", b);
}
} catch (UnsupportedEncodingException ex) {
comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
}
}
System.out.print("\"");
if (c < 255) {
System.out.print(",");
}
System.out.println();
}
System.out.println(" };");
}
I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.
There are two aspects to this question.
Distinguish UTF-8-encoded characters from ordinary ASCII characters.
UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.
Display the encoded characters in hexadecimal.
C++ has std::hex to tell streams to format numeric values in hexadecimal. You can use std::showbase to make the output look pretty. A char isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such as int. Beware of sign-extension, though.
Here's some code to demonstrate:
#include <iostream>
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; ++pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
You could call it like this:
int main()
{
char const* test = "ab\xef\xbb\xbfhu";
print_characters(test);
return 0;
}
Output on Solaris 10 with Sun C++ 5.8:
$ ./a.out
a b 0xef 0xbb 0xbf h u
The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.
I used *pc & 0xff to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb, for instance.
I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.