I'm using a wxWebView to retrieve the displayed contents of a webpage (using GetPageText( )). This returns me a wxString.
This wxString contains a lot of text, including names, such as D\u00e1vid K\u00e1m\u00e1n (instead of Dávid Káán)
I then process the string, split it into its various components, and display some of the values (& names) in a wxGrid.
What I am looking for is a way to convert the wxString containing the escape coded characters into a wxString that will contain the actual characters so that I can display the correct output in the wxGrid.
I can't see any obvious method available though...
One obvious method could be to use regular expressions to find these places, and replace them with the actual letter. Do it in a loop until no matches are found.
Related
I have a CComboBox containing strings of serial numbers. I want to be able to bold individual characters in the string. Like for instance, I would like to be able to make the second string 87650123 show up as 87650123.
I have seen some posts about bolding a whole individual string entry in a CCombobox, but not individual characters. Would this be possible? Thanks in advance.
I have attempted bolding the entire entry to start, which was successful, but have not been able to do individual characters.
What is this character
All I really need to know is what is this character. I have not seen anything like this before.
How do i remove this using Vb.net:
data = data.Replace(Chr(???????), "")
Is there a specific control character decimal number or something to this character that i can use in place of ??
Please help.
I tried looking up all the html, ascii and the regex languages to find this character but i did not find this anywhere.
To prevent possible bugs related to the encoding of your source files, you should use a hex editor (such as this Notepad++ plugin) to find the hexadecimal code of the character, then use that to reference the character in your code:
data = data.Replace((char)0xDB, "")
as opposed to:
data = data.Replace("Û", "")
Note: In this case the hex editor is unnecessary because xDB is already a hex code, but other control characters, such as CR and LF, are not displayed as their hex values [in Notepad++].
I have two versions of the same document (D, say) containing multilingual text (English and others):
I. One is encoded in ASCII with Unicode code-points represented as character entity references (i.e. Unicode characters are of the form &#N, where N is the decimal equivalent of the Unicode hex value)
II. The other is UTF-8 encoding.
Q 1:
I have a separate list of words (encoded in UTF-8, and in more than one language), that I have to remove from the document D. How should I proceed?
Can I use regex to clean D? For doc type I, I believe I have to specify the whole &#N patterns for each word in the list when I form the regex.
Should the task be easier for doc type II, now that I can specify the non-English characters directly in the regex (my emacs is configured to use these non-English fonts) ?
Q 2:
I have a huge collections of such document D's. What should be the best algorithm to remove words from each of these documents? A table look-up is straight-forward but probably the slowest. Should I regex through each?
I suggest processing the entities first so that the two sorts of files look the same. When you’re done removing, put the first set back into their encoded form.
I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.
fellows. I have a rather pervert question. Please forgive me :)
There's an official algorithm that describes how bidirectional unicode text should be presented.
http://www.unicode.org/reports/tr9/tr9-15.html
I receive a string (from some 3rd-party source), which contains latin/hebrew characters, as well as digits, white-spaces, punctuation symbols and etc.
The problem is that the string that I receive is already in the representation form. I.e. - the sequence of characters that I receive should just be presented from left to right.
Now, my goal is to find the unicode string which representation is exactly the same. Means - I need to pass that string to another entity; it would then render this string according to the official algorithm, and the result should be the same.
Assuming the following:
The default text direction (of the rendering entity) is RTL.
I don't want to inject "special unicode characters" that explicitly override the text direction (such as RLO, RLE, etc.)
I suspect there may exist several solutions. If so - I'd like to preserve the RTL-looking of the string as much as possible. The string usually consists of hebrew words mostly. I'd like to preserve the correct order of those words, and characters inside those words. Whereas other character sequences may (and should) be transposed.
One naive way to solve this is just to swap the whole string (this takes care of the hebrew words), and then swap inside it sequences of non-hebrew characters. This however doesn't always produce correct results, because actual rules of representation are rather complex.
The only comprehensive algorithm that I see so far is brute-force check. The string can be divided into sequences of same-class characters. Those sequences may be joined in random order, plus any of them may be reversed. I can check all those combinations to obtain the correct result.
Plus this technique may be optimized. For instance the order of hebrew words is known, so we only have to check different combinations of their "joining" sequences.
Any better ideas? If you have an idea, not necessarily the whole solution - it's ok. I'll appreciate any idea.
Thanks in advance.
If you want to check if a character is Bidirectional you have to use UCD (Unicode Character Database) which provided by Unicode.org and includes lots of information about characters . in one of that DB attributes you can find the Bidirectionality of a character
So you have to Download USD , then write a class to look for your character in the XML and return answer
I did this in an opensource C# application and you can ind it here http://Unicode.Codeplex.com
Please let me know has your issue resolved by this or not.
Nasser, thanks for the answer.
Unfortunately it doesn't fully resolve my problem.
So far for every character I can know its directionality. Still I don't see how can I compute the whole string so that its representation would match what I need.
Imagine you want to have the following text written from left to right, whereas hebrew/arabic characters are denoted by BIG:
ABC eng 123 456 DEF
The correct string would be like this:
FED 456 123 eng CBA
or also:
FED eng 456 123 CBA
Or, if using explicit direction override codes it can be written like this:
FED eng 123 456 CBA
Currently I solved this problem by injecting explicit directionality override codes into the string. So that I isolate sequences of hebrew/arabic words, and for all the joining LTR/Weak/Neutral characters I explicitly override the direction to LTR.
However I'd like to do this without injecting explicit override codes.