How can non-ASCII characters be detected in a QString? - c++

I want to detect if the user has inputted a non-ASCII (otherwise incorrectly known as Unicode) character (for example, り) in a file save dialog box. As I am using Qt, any non-ASCII characters are properly saved in a QString, but I can't figure out how to determine if any of the characters in that string are non-ASCII before converting the string to ASCII. That character above ends up getting written to the filesystem as ã‚Š.

There is no such a built-in feature in my understanding.
About 1-2 years ago, I was proposing an isAscii() method for QString/QChar to wrap the low-level Unix isacii() and the corresponding Windows function, but it was rejected. You could have written then something like this:
bool isUnicode = !myString.at(3).isAcii();
I still think this would be a handy feature if you can convince the maintainer. :-)
Other than that, you would need to check against the ascii boundary yourself, I am afraid. You can do this yourself as follows:
bool isUnicode = myChar.unicode() > 127;
See the documentation for details:
ushort QChar::unicode () const
This is an overloaded function.

The simplest way is to check every charachter's code (QChar::unicode()) to be below 128 if you need pure 7-bit ASCII.

To write it in compact way without loop, you can use regular expression:
bool containsNonASCII = myString.contains(QRegularExpression(QStringLiteral("[^\\x{0000}-\\x{007F}]")));

this works for me :
isLetterOrNumber()
ot_id += QChar((short) b.to_ulong()).isLetterOrNumber() ? QChar((short) b.to_ulong()) : QString("");

Related

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

How to drop control characters when using WideCharToMultibyte

I'm working on a Windows UI Automation client that interfaces with some legacy code. At the point where the legacy code is called I have to convert the LPWSTR to a char *, which works in most cases, but sometimes the input strings contain control characters (such as the invisible LTR control character), and WideCharToMultibyte always maps those characters to '?'.
Is it possible to drop those characters? Is there another function better suited to this purpose? Any help would be greatly appreciated!
Doesn't look like there's a single function that accomplishes this, so I'm using Mark Random's solution from the comments.
Let lpDefaultChar point to a character that will never be in your string, such as 0x01, then remove those characters from the output. – Mark Ransom

how to display extended ascii character in QTextEdit

All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.

How To Remove "Ctrl + Backspace" Special Character?

I have a server written in C++, and when receiving a chat string, I'd like to remove weird special characters like the one created by Ctrl + Backspace (though not other symbols like :)]>_ etc.)
I'm using Boost, too.
edit: Why'd this get -1'd? It's a legit question.
Sounds like isprint might help. It returns true for any printable character, ie. not for control characters and whitespaces. For a list of what is considered printable and what not, take a look at this table.
I haven't used it, and this probably isn't the best way to do it, but have you considered trying the boost regex library (i.e., regex_replace)?

How can I recognize RTL strings in C++

I need to know the direction of my text before printing.
I'm using Unicode Characters.
How can I do that in C++?
If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.
Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.
Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.
You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).
The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.
The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.
If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.
From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:
int aft_isrtl(int c){
if (
(c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
((c>=0x05D0)&&(c<=0x05F4))||
(c==0x0608)||(c==0x060B)||(c==0x060D)||
((c>=0x061B)&&(c<=0x064A))||
((c>=0x066D)&&(c<=0x066F))||
((c>=0x0671)&&(c<=0x06D5))||
((c>=0x06E5)&&(c<=0x06E6))||
((c>=0x06EE)&&(c<=0x06EF))||
((c>=0x06FA)&&(c<=0x0710))||
((c>=0x0712)&&(c<=0x072F))||
((c>=0x074D)&&(c<=0x07A5))||
((c>=0x07B1)&&(c<=0x07EA))||
((c>=0x07F4)&&(c<=0x07F5))||
((c>=0x07FA)&&(c<=0x0815))||
(c==0x081A)||(c==0x0824)||(c==0x0828)||
((c>=0x0830)&&(c<=0x0858))||
((c>=0x085E)&&(c<=0x08AC))||
(c==0x200F)||(c==0xFB1D)||
((c>=0xFB1F)&&(c<=0xFB28))||
((c>=0xFB2A)&&(c<=0xFD3D))||
((c>=0xFD50)&&(c<=0xFDFC))||
((c>=0xFE70)&&(c<=0xFEFC))||
((c>=0x10800)&&(c<=0x1091B))||
((c>=0x10920)&&(c<=0x10A00))||
((c>=0x10A10)&&(c<=0x10A33))||
((c>=0x10A40)&&(c<=0x10B35))||
((c>=0x10B40)&&(c<=0x10C48))||
((c>=0x1EE00)&&(c<=0x1EEBB))
) return 1;
return 0;
}
If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.