How to drop control characters when using WideCharToMultibyte - c++

I'm working on a Windows UI Automation client that interfaces with some legacy code. At the point where the legacy code is called I have to convert the LPWSTR to a char *, which works in most cases, but sometimes the input strings contain control characters (such as the invisible LTR control character), and WideCharToMultibyte always maps those characters to '?'.
Is it possible to drop those characters? Is there another function better suited to this purpose? Any help would be greatly appreciated!

Doesn't look like there's a single function that accomplishes this, so I'm using Mark Random's solution from the comments.
Let lpDefaultChar point to a character that will never be in your string, such as 0x01, then remove those characters from the output. – Mark Ransom

Related

Right to left isolate string. C++

does anyone have experience in Unicodes?
I am facing a tough problem with Farsi unicodes.
I have an std::wstring s = (L"\u0634\u0646\u0628\u0647"); which is a Farsi word. When I debug it, I see that the underlying word is exactly what I want, but reversed. So I have researched and found that u2067 is for right to left reading the string.
NOTE:
I cannot reverse the string manually because Farsi characters are changing their shape regardless of their position in the string.
So I added the 2067 int the beginning and got
std::wstring s = (L"\u2067\u0634\u0646\u0628\u0647");.
But now the underlying string is the same , just added a square in the beginning if the string instead of reversing.
Does anyone have experince with this stuff? Please suggest a solution. Thanks!
The underlying string will be the same. You haven't changed the order of bytes, which is written right there in the code. But a renderer that understands Unicode should take those bytes and display the characters right-to-left. That's a visual thing. It has nothing to do with the encoding. From your question, it's not entirely clear what else you expected. It may be that you are viewing the string in a debugger, and the debugger does not support this feature of Unicode. If you try outputting the string to a proper console you ought to see it as you expect.

How to insert check mark "✓" in std::string in c++ programming using VS2010

In one of my application I'm developing in c++, I have to display "✓" mark. For that I need to first insert the same in a std::string or in a char. But when I do that, I'm getting a "?" mark as output. I'm using VS2010 to code. Please suggest how to solve the same. Thanks in advance.
There seems to be some basic misunderstanding.
The checkmark character is Unicode 0x2713. You cannot store it as a single character in std::string. The maximum value for a char is 0xff (255). It won't fit.
If you are developing a GUI using C++ for Windows then I would guess MFC. However if you are using std::string then perhaps that's not so. Some choices:
For MFC, you can rebuild your application in UNICODE mode. Then chars are short (16 bits) and your checkmark will fit fine.
You could use std::wstring instead of string. That means changes to existing code.
You could use UTF-8, which replaces the character by a multi-byte sequence. Not recommended in Windows, even if you think you know what you're doing. Very unfriendly.
In any case, if you are using GUI and dialog boxes you will have to make sure they are Unicode dialog boxes or nothing will work.
With a few more details, we could give more specific advice.
you can insert check mark in your console using C++ using the following code
cout << " (\xfb) "<<endl;
Output:
(√)

How can non-ASCII characters be detected in a QString?

I want to detect if the user has inputted a non-ASCII (otherwise incorrectly known as Unicode) character (for example, り) in a file save dialog box. As I am using Qt, any non-ASCII characters are properly saved in a QString, but I can't figure out how to determine if any of the characters in that string are non-ASCII before converting the string to ASCII. That character above ends up getting written to the filesystem as ã‚Š.
There is no such a built-in feature in my understanding.
About 1-2 years ago, I was proposing an isAscii() method for QString/QChar to wrap the low-level Unix isacii() and the corresponding Windows function, but it was rejected. You could have written then something like this:
bool isUnicode = !myString.at(3).isAcii();
I still think this would be a handy feature if you can convince the maintainer. :-)
Other than that, you would need to check against the ascii boundary yourself, I am afraid. You can do this yourself as follows:
bool isUnicode = myChar.unicode() > 127;
See the documentation for details:
ushort QChar::unicode () const
This is an overloaded function.
The simplest way is to check every charachter's code (QChar::unicode()) to be below 128 if you need pure 7-bit ASCII.
To write it in compact way without loop, you can use regular expression:
bool containsNonASCII = myString.contains(QRegularExpression(QStringLiteral("[^\\x{0000}-\\x{007F}]")));
this works for me :
isLetterOrNumber()
ot_id += QChar((short) b.to_ulong()).isLetterOrNumber() ? QChar((short) b.to_ulong()) : QString("");

How can I recognize RTL strings in C++

I need to know the direction of my text before printing.
I'm using Unicode Characters.
How can I do that in C++?
If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.
Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.
Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.
You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).
The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.
The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.
If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.
From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:
int aft_isrtl(int c){
if (
(c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
((c>=0x05D0)&&(c<=0x05F4))||
(c==0x0608)||(c==0x060B)||(c==0x060D)||
((c>=0x061B)&&(c<=0x064A))||
((c>=0x066D)&&(c<=0x066F))||
((c>=0x0671)&&(c<=0x06D5))||
((c>=0x06E5)&&(c<=0x06E6))||
((c>=0x06EE)&&(c<=0x06EF))||
((c>=0x06FA)&&(c<=0x0710))||
((c>=0x0712)&&(c<=0x072F))||
((c>=0x074D)&&(c<=0x07A5))||
((c>=0x07B1)&&(c<=0x07EA))||
((c>=0x07F4)&&(c<=0x07F5))||
((c>=0x07FA)&&(c<=0x0815))||
(c==0x081A)||(c==0x0824)||(c==0x0828)||
((c>=0x0830)&&(c<=0x0858))||
((c>=0x085E)&&(c<=0x08AC))||
(c==0x200F)||(c==0xFB1D)||
((c>=0xFB1F)&&(c<=0xFB28))||
((c>=0xFB2A)&&(c<=0xFD3D))||
((c>=0xFD50)&&(c<=0xFDFC))||
((c>=0xFE70)&&(c<=0xFEFC))||
((c>=0x10800)&&(c<=0x1091B))||
((c>=0x10920)&&(c<=0x10A00))||
((c>=0x10A10)&&(c<=0x10A33))||
((c>=0x10A40)&&(c<=0x10B35))||
((c>=0x10B40)&&(c<=0x10C48))||
((c>=0x1EE00)&&(c<=0x1EEBB))
) return 1;
return 0;
}
If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.