I've just spent some time figuring out the pcre2 interface and think I've got it for the most part. I want to support UTF32, pcre2 is already built with support and code point width has been set to 32.
The code below is what I've got for working with code point width set to 8.
How do I change this to work with UTF32?
#include "gtest/gtest.h"
#include <pcre2.h>
TEST(PCRE2, example) {
//iterate over all matches in a string
PCRE2_SPTR subject = (PCRE2_SPTR) string("this is it").c_str();
PCRE2_SPTR pattern = (PCRE2_SPTR) string("([a-z]+)|\\s").c_str();
int errorcode;
PCRE2_SIZE erroroffset;
pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_ANCHORED | PCRE2_UTF, &errorcode,
&erroroffset, NULL);
if (re) {
uint32_t groupcount = 0;
pcre2_pattern_info(re, PCRE2_INFO_BACKREFMAX, &groupcount);
pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL);
uint32_t options_exec = PCRE2_NOTEMPTY;
PCRE2_SIZE subjectlen = strlen((const char *) subject);
errorcode = pcre2_match(re, subject, subjectlen, 0, options_exec, match_data, NULL);
while (errorcode >= 0) {
PCRE2_UCHAR *result;
PCRE2_SIZE resultlen;
for (int i = 0; i <= groupcount; i++) {
pcre2_substring_get_bynumber(match_data, i, &result, &resultlen);
printf("Matched:%.*s\n", (int) resultlen, (const char *) result);
pcre2_substring_free(result);
}
// Advance through subject
PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data);
errorcode = pcre2_match(re, subject, subjectlen, ovector[1], options_exec, match_data, NULL);
}
pcre2_match_data_free(match_data);
pcre2_code_free(re);
} else {
// Syntax error in the regular expression at erroroffset
PCRE2_UCHAR error[256];
pcre2_get_error_message(errorcode, error, sizeof(error));
printf("PCRE2 compilation failed at offset %d: %s\n", (int) erroroffset, (char *) error);
}
Presumably subject and pattern needs to be converted somehow and result would be of the same type? I couldn't find anything in pcre2 header to indicate support for that.
And I guess subjectlen would no longer be simply strlen.
Finally, I put this example together from having gone through some of the docs and the header, is there anything else I should be doing/worth knowing.
I left pcre2 in the end, after evaluating RE2, PCRE2 and ICU, I chose ICU. Its unicode support (from what I've seen so far) much more complete than the other two. It also provides a very clean API and lots of utilities for manipulation. Importantly, like PCRE2 provides a perl style regex engine which, out of the box works great with unicode.
If you set code width properly this could be the problem:
(PCRE2_SPTR) string("this is it").c_str();
Converting the c_str() to PCRE2_SPTR doesn't make the string utf32.
If you are unsure about setting the proper code width (I didn't see it in your sorce code) you can force 32 bit by adding _32 postfix to everything e.g. pcre2_compile_32.
It depends on which character type you are going to use and which system you are going to target.
The basic unit for std::string is char which is generally 8 bit and supports UTF-8 (can be different depending on implementation/system). So you can't use std::string("some string") and such codes when dealing with UTF-32 in such systems.
PCRE2_CODE_UNIT_WIDTH must match the bit-size of the basic character unit you are going to use. For 8-bit char it should be defined as 8, for 16-bit char it should be defined as 16 etc...
In GNU/Linux, you can use wchar_t i.e std::wstring which is 32 bit and UTF-32 supported. In windows wchar_t is 16 bit (with UTF-16).
In >=C++11, you can use char32_t i.e std::u32string which is at least 32 bit (you will have to make sure it's exactly 32 bit in your target system)
I have a wrapper for PCRE2 in C++ which contains some examples (in src directory) on how to handle UTF-16 and UTF-32 modes.
Related
There seems to be a problem when I'm writing words in foreign characters (french...)
For example, if I ask for input for an std::string or a char[] like this:
std::string s;
std::cin>>s; //if we input the string "café"
std::cout<<s<<std::endl; //outputs "café"
Everything is fine.
Although if the string is hard-coded
std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"
What is going on? What characters are supported by C++ and how do I make it work right? Does it have something to do with my operating system (Windows 10)? My IDE (VS 15)? Or with C++?
In a nutshell, if you want to pass/receive unicode text to/from the console on Windows 10 (in fact any version of Windows), you need to use wide strings, IE, std::wstring. Windows itself doesn't support UTF-8 encoding. This is a fundamental OS limitation.
The entire Win32 API, on which things like console and file system access are based, only works with unicode characters under the UTF-16 encoding, and the C/C++ runtimes provided in Visual Studio don't offer any kind of translation layer to make this API UTF-8 compatible. This doesn't mean you can't use UTF-8 encoding internally, it just means that when you hit the Win32 API, or a C/C++ runtime feature that uses it, you'll need to convert between UTF-8 and UTF-16 encoding. It sucks, but it's just where we are right now.
Some people might direct you to a series of tricks that proport to make the console work with UTF-8. Don't go this route, you'll run into a lot of problems. Only wide-character strings are properly supported for unicode console access.
Edit: Because UTF-8/UTF-16 string conversion is non-trivial, and there also isn't much help provided for this in C++, here are some conversion functions I prepared earlier:
///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
// Convert the encoding of the supplied string
std::wstring stringUTF16;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF8.size();
stringUTF16.reserve(sourceStringSize);
while (sourceStringPos < sourceStringSize)
{
// Determine the number of code units required for the next character
static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];
// Ensure that the requested number of code units are left in the source string
if ((sourceStringPos + codeUnitCount) > sourceStringSize)
{
break;
}
// Convert the encoding of this character
switch (codeUnitCount)
{
case 1:
{
stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
break;
}
case 2:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 3:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 4:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
stringUTF16.push_back(convertedCodeUnit1);
stringUTF16.push_back(convertedCodeUnit2);
break;
}
}
// Advance past the converted code units
sourceStringPos += codeUnitCount;
}
// Return the converted string to the caller
return stringUTF16;
}
///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
// Convert the encoding of the supplied string
std::string stringUTF8;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF16.size();
stringUTF8.reserve(sourceStringSize * 2);
while (sourceStringPos < sourceStringSize)
{
// Check if a surrogate pair is used for this character
bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);
// Ensure that the requested number of code units are left in the source string
if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
{
break;
}
// Decode the character from UTF-16 encoding
unsigned int unicodeCodePoint;
if (usesSurrogatePair)
{
unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
}
else
{
unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
}
// Encode the character into UTF-8 encoding
if (unicodeCodePoint <= 0x7F)
{
stringUTF8.push_back((char)unicodeCodePoint);
}
else if (unicodeCodePoint <= 0x07FF)
{
char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
}
else if (unicodeCodePoint <= 0xFFFF)
{
char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
}
else
{
char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
stringUTF8.push_back(convertedCodeUnit4);
}
// Advance past the converted code units
sourceStringPos += (usesSurrogatePair) ? 2 : 1;
}
// Return the converted string to the caller
return stringUTF8;
}
I was in charge of the unenviable task of converting a 6 million line legacy Windows app to support Unicode, when it was only written to support ASCII (in fact its development pre-dates Unicode), where we used std::string and char[] internally to store strings. Since changing all the internal string storage buffers was simply not possible, we needed to adopt UTF-8 internally and convert between UTF-8 and UTF-16 when hitting the Win32 API. These are the conversion functions we used.
I would strongly recommend sticking with what's supported for new Windows development, which means wide strings. That said, there's no reason you can't base the core of your program on UTF-8 strings, but it will make things more tricky when interacting with Windows and various aspects of the C/C++ runtimes.
Edit 2: I've just re-read the original question, and I can see I didn't answer it very well. Let me give some more info that will specifically answer your question.
What's going on? When developing with C++ on Windows, when you use std::string with std::cin/std::cout, the console IO is being done using MBCS encoding. This is a deprecated mode under which the characters are encoded using the currently selected code page on the machine. Values encoded under these code pages are not unicode, and cannot be shared with other systems that have a different code page selected, or even the same system if the code page is changed. It works perfectly in your test, because you're capturing the input under the current code page, and displaying it back under the same code page. If you try capturing that input and saving it to a file, inspection will show it's not unicode. Load it back with a different code page selected in our OS, and the text will appear corrupted. You can only interpret text if you know what code page it was encoded in. Since these legacy code pages are regional, and none of them can represent all text characters, it makes it effectively impossible to share text universally across different machines and computers. MBCS pre-dates the development of unicode, and it was specifically because of these kind of issues that unicode was invented. Unicode is basically the "one code page to rule them all". You might be wondering why UTF-8 isn't a selectable "legacy" code page on Windows. A lot of us are wondering the same thing. Suffice to say, it isn't. As such, you shouldn't rely on MBCS encoding, because you can't get unicode support when using it. Your only option for unicode support on Windows is using std::wstring, and calling the UTF-16 Win32 API's.
As for your example about the string being hard-coded, first of all understand that encoding non-ASCII text into your source file puts you into the realm of compiler-specific behaviour. In Visual Studio, you can actually specify the encoding of the source file (Under File->Advanced Save Options). In your case, the text is coming out different to what you'd expect because it's being encoded (most likely) in UTF-8, but as mentioned, the console output is being done using MBCS encoding on your currently selected code page, which isn't UTF-8. Historically, you would have been advised to avoid any non-ASCII characters in source files, and escape any using the \x notation. Today, there are C++11 string prefixes and suffixes that guarantee various encoding forms. You could try using these if you need this ability. I have no practical experience using them, so I can't advise if there are any issues with this approach.
The problem originates with Windows itself. It uses one character encoding (UTF-16) for most internal operations, another (Windows-1252) for default file encoding, and yet another (Code Page 850 in your case) for console I/O. Your source file is encoded in Windows-1252, where é equates to the single byte '\xe9'. When you display this same code in Code Page 850, it becomes Ú. Using u8"é" produces a two byte sequence "\xc3\xa9", which prints on the console as ├®.
Probably the easiest solution is to avoid putting non-ASCII literals in your code altogether and use the hex code for the character you require. This won't be a pretty or portable solution though.
std::string s="caf\x82";
A better solution would be to use u16 strings and encode them using WideCharToMultiByte.
What characters are supported by C++
The C++ standarad does not specify which characters are supported. It is implementation specific.
Does it have something to do with...
... C++?
No.
... My IDE?
No, although an IDE might have an option to edit a source file in particular encoding.
... my operating system?
This may have an influence.
This is influenced by several things.
What is the encoding of the source file.
What is the encoding that the compiler uses to interpret the source file.
Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly).
The native encoding of your operating system probably influences what character encoding your compiler expects by default.
What encoding does the terminal that runs the program support.
Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly without conversion).
Is the used character encoding wide. By wide, I mean whether the width of a code unit is more than CHAR_BIT. A wide source / compiler will cause a conversion into another, narrow encoding since you use a narrow string literal and narrow stream operator. In this case, you'll need to figure out both the native narrow and the native wide character encoding expected by the compiler. The compiler will convert the input string into the narrow encoding. If the narrow encoding has no representation for the character in the input encoding, it might not work correctly.
An example:
Source file is encoded in UTF-8. Compiler expects UTF-8. The terminal expects UTF-8. In this case, what you see is what you get.
The trick here is setlocale:
#include <clocale>
#include <string>
#include <iostream>
int main() {
std::setlocale(LC_ALL, "");
std::string const s("café");
std::cout << s << '\n';
}
The output for me with the Windows 10 Command Prompt is correct, even without changing the terminal codepage.
I don't know how to solve that:
Imagine, we have 4 websites:
A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters.
Then I send these words to a server. The database and the web-frontend are using UTF-8.
So my questions are:
How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
How can I convert "any" (or the most used) character encoding to UTF-8?
ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).
You create a converter for a given encoding...
#include <ucnv.h>
UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
// ...
ucnv_close( converter );
}
...and then use the UnicodeString class as appripriate.
I think wchar_t does not work because it is 2 bytes long.
The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
Check BreakIterator.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.
No sure if this will give you everything you're looking for but it might help a little.
Have you tried looking at:
1) Boost.Locale library ?
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
2) Or at
conversions are part of C++11?
#include <codecvt>
#include <locale>
#include <string>
#include <cassert>
int main() {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
std::string utf8 = convert.to_bytes(0x5e9);
assert(utf8.length() == 2);
assert(utf8[0] == '\xD7');
assert(utf8[1] == '\xA9');
}
How can I work with UTF-8-strings in C++? I think wchar_t does not
work because it is 2 bytes long. Code-Points in UTF-8 are up to 4
bytes long...
This is easy, there is a project named tinyutf8 , which is a drop-in replacement for std::string/std::wstring.
Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.
How can I convert "any" (or the most used) character encoding to
UTF-8?
You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).
UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).
It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.
Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.
Content-Type: text/html; charset=UTF-8
If not present, there also there might be
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8"> <!-- HTML5 -->
ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such.
Even browsers on MacOS will understand these though ISO-8859-1 was specified.
Conversion libraries: alread mentioned by #syam.
Conversion
Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.
The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.
table[157] = "\xEF\xBF\xBD";
public static void main(String[] args) {
final String SOURCE_ENCODING = "windows-1252";
byte[] sourceBytes = new byte[1];
System.out.println(" const char* table[] = {");
for (int c = 0; c < 256; ++c) {
String comment = "";
System.out.printf(" /* %3d */ \"", c);
if (32 <= c && c < 127) {
// Pure ASCII
if (c == '\"' || c == '\\')
System.out.print("\\");
System.out.print((char)c);
} else {
if (c == 0) {
comment = " // Unusable";
}
sourceBytes[0] = (byte)c;
try {
byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
for (int j = 0; j < targetBytes.length; ++j) {
int b = targetBytes[j] & 0xFF;
System.out.printf("\\x%02X", b);
}
} catch (UnsupportedEncodingException ex) {
comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
}
}
System.out.print("\"");
if (c < 255) {
System.out.print(",");
}
System.out.println();
}
System.out.println(" };");
}
This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 , the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}
The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)
As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.
The input is given in a language with a script other than the roman alphabets.A program in c or c++ must recognize them..
How do i take input in Tamil and split it into letters so that i can recognize each Tamil alphabet?
how do i use wchar_t and locale?
The C++ standard libraries do not handle Unicode completely, neither does C; you'd be better off using a library like Boost, which is cross platform
Including and using WinAPI and windows.h allow's you to use Unicode, but only on Win32 programs.
See here for a previous rant of mine on this subject.
Assuming that your platform is capable of handling Tamil characters, I suggest the following sequence of events:
I. Get the input string into a wide string:
#include <clocale>
int main()
{
setlocale(LC_CTYPE, "");
const char * s = getInputString(); // e.g. from the command line
const size_t wl = mbstowcs(NULL, s, 0);
wchar_t * ws = new wchar_t[wl];
mbstowcs(ws, s, wl);
//...
II. Convert the wide string into a string with definite encoding:
#include <iconv.h>
// ...
iconv_t cd = iconv_open("UTF32", "WCHAR_T");
size_t iin = wl;
size_t iout = 2 * wl; // random safety margin
uint32_t * us = new uint32_t[iout];
iconv(cd, reinterpret_cast<char*>(ws), &iin, reinterpret_cast<char*>(us), &iout);
iconv_close(cd);
// ...
Finally, you have in us an array of Unicode codepoints that made up your input text. You can now process this array, e.g. by looking each codepoint up in a list and checking whether it comes from the Tamil script, and do with it whatever you see fit.
Environment: Gcc/G++ Linux
I have a non-ascii file in file system and I'm going to open it.
Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)
Please help. Thanks a lot.
There are two possible answers:
If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.
If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").
And finally, not exactly an answer but a recommendation:
Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.
Linux is not UTF-8, but it's your only choice for filenames anyway
(Files can have anything you want inside them.)
With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.
This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.
But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.
The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.
Convert wchar string to utf8 char string, then use fopen.
typedef unsigned int uint;
typedef unsigned short word;
typedef unsigned char byte;
int UTF16to8( wchar_t* w, char* s ) {
uint c;
word* p = (word*)w;
byte* q = (byte*)s; byte* q0 = q;
while( 1 ) {
c = *p++;
if( c==0 ) break;
if( c<0x080 ) *q++ = c; else
if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else
*q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
}
*q = 0;
return q-q0;
}
int UTF8to16( char* s, wchar_t* w ) {
uint cache,wait,c;
byte* p = (byte*)s;
word* q = (word*)w; word* q0 = q;
while(1) {
c = *p++;
if( c==0 ) break;
if( c<0x80 ) cache=c,wait=0; else
if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else
if( (c>=0xE0) ) cache=c&15,wait=2; else
if( wait ) (cache<<=6)+=c&63,wait--;
if( wait==0 ) *q++=cache;
}
*q = 0;
return q-q0;
}
Check out this document
http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm
I think Linux follows POSIX standard, which treats all file names as UTF-8.
I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.
You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.
It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.