For example I need codepoint of 5th character here, that is ð
const WCHAR* mystring = L"Þátíð";
I know that it has code point : U+00F0 - but how to get this integer using c++ ?
WCHAR in Windows 2000 and later is UTF-16LE so it is not necessarily safe to access a specific character in a string by index. You should use something like CharNext to walk the string to get correct handling of surrogate pairs and combining characters/diacritics.
In this specific example Forgottn's answer depends on the compiler emitting precomposed versions of the á and í characters... (This is probably true for most Windows compilers, porting to Mac OS is probably problematic)
const WCHAR myString[] = L"Þátíð";
size_t myStringLength = 0;
if(SUCCEEDED(StringCchLengthW(myString, STRSAFE_MAX_CCH, &myStringLength))
{
LPCWSTR myStringIterator = myString;
for(size_t sz = 0; sz < myStringLength; ++sz)
{
unsigned int mySuperSecretUnicodeCharacter = *myStringIterator;
LPCWSTR myNextIterator = CharNext(myStringIterator);
std::vector<unsigned int> diacriticsOfMySuperSecretUnicodeCharacter(myStringIterator+1, myNextIterator);
myStringIterator = myNextIterator;
}
}
Edit 1: made it actually work
Edit 2: made it actually look for all codepoints
Related
So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers
L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.
The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf
Converting CString to an int in ASCII mode is as simple as
CString s("123");
int n = atoi(s);
However that doesn't work for projects in UNICODE mode as CString becomes a wide-char string.
How do I write my code to cover both ASCII and UNICODE modes without extra if statements?
Turns out there's a _ttoi() available just for that purpose:
CString s( _T("123") );
int n = _ttoi(s);
This works for both modes with no extra effort.
If you need to convert hexadecimal (or other-base) numbers you can resort to a more generic strtol() variant:
CString s( _T("0xFA3") );
int n = _tcstol(s, nullptr, 16);
There's a special version of CString that uses multibyte characters even if your build is specified for wide characters - CStringA. It will also convert from wide characters automatically.
CString s(_T("123"));
CStringA sa = s;
int n = atoi(sa);
There's a corresponding CStringW that only uses wide characters.
My code is the following (reduced):
CComVariant* input is an input parameter
CString cstrPath(input ->bstrVal);
const CHAR cInvalidChars[] = {"/*&#^°\"§$[]?´`\';|\0"};
for (unsigned int i = 0; i < strlen(cInvalidChars); i++)
{
cstrPath.Replace(cInvalidChars[i],_T(''));
}
When debugging, value of cstrPath is L"§", value of cInvalidChars[7] is -89 '§'
I have tried to use .Remove() before, but the problem remains the same: when it comes to § or ´, the code table does not seem to match and the char does not get recognized properly and will not be removed. using a TCHAR array for invalidChars results in even different problems ('§' -> 'ᄡ').
The problem seems that I am not using the correct code tables, but everything I tried so far did not result in any success.
I want to successfully replace/delete any occuring '§'..
I also have had a look at several "delete character from string"-Posts but I did not find anything that helped me.
executable code:
CComVariant* pccovaValue = new CComVariant();
pccovaValue->bstrVal = L"§§";
const CHAR cInvalidChars[] = {"§"};
CString cstrPath(pccovaValue->bstrVal);
for (unsigned int i = 0; i < strlen(cInvalidChars); i++)
{
cstrPath.Remove(cInvalidChars[i]);
}
cstrPath = cstrPath;
just break into cstrPath = cstrPath;
According to the comments you are mixing up Unicode and ANSI encodings. It seems that your application is targeting Unicode which is good. You should stop using ANSI altogether.
Declare cInvalidChars like this:
CString cInvalidChars = L"/*&#^°\"§$[]?´`\';|";
The use of the L prefix means that the string literal is a wide character UTF-16 literal.
Then your loop can look like this:
for (int i = 0; i < cInvalidChars.GetLength(); i++)
cstrPath.Remove(cInvalidChars[i]);
This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 , the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}
The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)
As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.
I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.
Why not just use a library routine wcstombs.
assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.
You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.
An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)
A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}
Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.
Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}
one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];
In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65