CoreFoundation printing Unicode characters - c++

I have the current code and it does seem to work except for the fact CFShow doesn't translate the unicode UTF8 encoding of \u00e9 to é
#include <CoreFoundation/CoreFoundation.h>
int main()
{
char *s = "This is a test of unicode support: fiancée\n";
CFTypeRef cfs = CFStringCreateWithCString(NULL, s, kCFStringEncodingUTF8);
CFShow(cfs);
}
Output is
This is a test of unicode support: fianc\u00e9e
|____|
> é doesn't output properly.
How do I instruct CFShow that it is unicode? printf handles it fine when it is a c string.

CFShow() is only for debugging. It's deliberately converting non-ASCII to escape codes in order to avoid ambiguity. For example, "é" can be expressed in two ways: as U+00E9 LATIN SMALL LETTER E WITH ACUTE or as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. If CFShow() were to emit the UTF-8 sequence, your terminal would likely present it as "é" and you wouldn't be able to tell which variant was in the string. That would undermine the usefulness of CFShow() for debugging.
Why do you care what the output of CFShow() so long as it you understand what the content of the string is?

It appears to me that CFShow knows that the string is Unicode, but doesn't know how to format Unicode for the console. I doubt that you can do anything but look for an alternative, perhaps NSLog.

Related

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

Qt QString from string - Strange letters

Whenever I try to convert a std::string into a QString with this letter in it ('ß'), the QString will turn into something like "Ã" or some other really strange letters. What's wrong? I used this code and it didn't cause any errors or warnings!
std::string content = "Heißes Teil.";
ui->txtFind_lang->setText(QString::fromStdString(content));
The std::string has no problem with this character. I even wrote it into a text file without problems. So what am I doing wrong?
You need to set the codec to UTF-8 :
QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
By default, Qt uses the Latin-1 encoding, which is limited. By adding this code, you set the default encoding to UTF-8 which allow you to use much more characters.
Though antoyo's answer works, I wasn't too sure why. So, I decided to investigate.
All of my documents are encoded in UTF-8, as are most web-pages. The ß character has the UTF code point of UTF+00DF.
Since UTF-8 is a variable length encoding, in the binary form, ß would be encoded as 11000011 10011111 or C3 9F. Since by default Qt relies on Latin1 encoding. It would read ß as two different characters. The first one C3 will map to à and the second one 9F will not map to anything as Latin1 does not recognize bytes in between 128-159 (in decimal).
That's why ß appears as à when using Latin1 encoding.
Side note: You might want to brush up on how UTF-8 encoding works, because otherwise it seems a little unintuitive that ß takes two bytes even though its code point DF is less than FF and should consume just one byte.

Multi-Byte to Widechar conversion using mbsnrtowcs

I'm trying to convert a multi-byte(UTF) string to Widechar string and mbsnrtowcs is always failing. Here is the input and expected strings:
char* pInputMultiByteString = "A quick brown Fox jumps \xC2\xA9 over the lazy Dog.";
wchar_t* pExpectedWideString = L"A quick brown Fox jumps \x00A9 over the lazy Dog.";
Special character is the copyright symbol.
This conversion works fine when I use Windows MultiByteToWideChar routine, but since that API is not available on linux, I have to use mbsnrtowcs - which is failing. I've tried using other characters as well and it always fails. The only expection is that when I use only an ASCII based Input string then mbsnrtowcs works fine. What am I doing wrong?
UTF is not a multibyte string (although it is true that unicode characters will be represented using more than 1 byte). A multibyte string is a string that uses a certain codepage to represent characters and some of them will use more than one byte.
Since you are combining ANSI chars and UTF chars you should use UTF8.
So trying to convert UTF to wchar_t (which on windows is UTF16 and on linux is UTF32) using mbsnrtowcs just can't be done.
If you use UTF8 you should look into a UNICODE handling library for that. For most tasks I recommend using UTF8-CPP from http://utfcpp.sourceforge.net/
You can read more on UNICODE and UTF8 on Wikipedia.
MultiByteToWideChar has a parameter where you specify the code page, but mbsnrtowcs doesn't. On Linux, have you set LC_CTYPE in your locale to specify UTF-8?
SOLUTION: By default each C program uses "C" locale, so I had to call setlocale(LCTYPE,"").."" means that it'll use my environment's locale i.e. en_US.utf8 and the conversion worked.

Use regular expression to match ANY Chinese character in utf-8 encoding

For example, I want to match a string consisting of m to n Chinese characters, then I can use:
[single Chinese character regular expression]{m,n}
Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
The regex to match a Chinese (well, CJK) character is
\p{script=Han}
which can be appreviated to simply
\p{Han}
This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.
In Java,
\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}
In C#
new Regex(#"\p{IsCJKUnifiedIdeographs}")
Here it is in the Microsoft docs
And here's more info from Wikipedia: CJK Unified Ideographs
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).
Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
Recommendation
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.
You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.
Background
In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:
digit [0-9]
alpha ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id ({alpha})({alpha}|{digit})*
The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.
Safety
When using UTF-8 directly in Flex patterns, there are several concerns:
Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.
See also invalid UTF encodings in the RE/flex user guide.
In Java 7 and up, the format should be: "\p{IsHan}"
Just solved a similar problem,
when you have too much stuff to match, is better use a negated-set and declare what you don't want to match like:
all but not numbers: ^[^0-9]*$
the second ^ will implement the negation
just like this:
package main
import (
"fmt"
"regexp"
)
func main() {
compile, err := regexp.Compile("\\p{Han}") // match one any Chinese character
if err != nil {
return
}
str := compile.FindString("hello 世界")
fmt.Println(str) // output: 世
}

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .
The given string is in UTF-8 Format.
If there is any way to know Unicode for UTF-8 Character? For example:
Character '≠' has U+2260 Unicode value.
Character '建' has U+5EFA Unicode value.
The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.
To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.
This table explain how to make the conversion:
UTF-8 (char*) | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...
libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.