Characters debugging based on code points

Characters debugging based on code points - regex

I have a character vector with multiple coding "errors" which I pulled from it.dbpedia.org. In fact, each accented characters is rendered incorrectly like "\"Democrazia Ã¨ LibertÃ - La Margherita\"#it" instead of \"Democrazia è Libertà - La Margherita\"#it.
I found a debugging map for this kind of encoding problems here. Still I noticed that the relation between "actual" and "expected" characters is not one-to-one (as I would expect) but one-to-many. Then my character "Ã" might alternatively translate as "Á", "Í", "Ï", "Ð", "Ý", "à". In other words, I can't use a pattern/replacement solution for actual/expected characters.
Can I use a pattern/replacement solution with Unicode code points / expected characters? How do I pass to gsub() the unicode code point instead of the actual characters?
Should I use instead a package as stringi to solve the encoding issue? How?
UPDATE: I just noticed the problem is at the source: the XML output of SPARQL.
NOTE: Related to this unanswered question.

Related

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.

It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1

The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.

I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Translate accented to unaccented characters in Sublime Text snippet using regex

I'm writing a ST3 snippet that inserts a \subsection{} with a label. The label is created by converting the header text to conform with the LaTeX standards for labels using a (rather lengthy) regular expression:
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:N)(?9:n)(?10O)(?11:o)(?12:U)(?13:u)/g}
Actually, I would like for it to be even longer. But if I add the extra groups that I would like, then ST3 crashes when I execute the snippet.
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([Ç])?|\b)(?:([ç])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)(?:([Ý])?|\b)(?:([ÿý])?|\b)/(?1:-)(?2:A)(?3:a)(?4:C)(?5:c)(?6:E)(?7:e)(?8:I)(?9:i)(?10:O)(?11:o)(?12:N)(?13:n)(?14:U)(?15:u)(?16:Y)(?17:y)/g}
Is there any more efficient way of doing this? Preferably one that won't cause ST3 to crash ;)
Edit:
Here are some example strings:
Flygande bæckasiner søka hwila på mjuka tuvor
Åke Staël hade en överflödig idé
And the results (with the current, working regex):
Flygande-backasiner-soka-hwila-pa-mjuka-tuvor
Ake-Stael-hade-en-overflodig-ide
But I would like to also replace the characters (ÇçÝÿý) with their unaccented counterparts (CcYyy) so that e.g.
Comment ça va
becomes
Comment-ca-va

I don't know this syntax, but I suspect that the problem comes from the too many optional groups combined with a lot of alternatives that cause a too complex processing.
So you can try to design your pattern like this, and you can add other groups of letters in the same way (take a look at the unicode table to find character ranges):
${1/([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
if the lookahead feature is available you can improve this pattern to prevent non-accented characters to be tested with each alternatives:
${1/(?=[ \t_À-ÆÈ-ÏÑ-ÖØ-Üà-æè-ïñ-öø-üŒœ])(?:([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ))/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
Note: Æ (Aelig) must be transliterated as AE (the same for Œ => OE)

Using xsl:character-map on text nodes only

I am trying to create a generic stylesheet that can convert all Latin characters in Unicode to uppercase ASCII characters. Using <xsl:character-map> works well except for one thing: namespaces. The character map converts all of my namespaces to upper case, which I do not want.
Is there a way to utilize a character map to do what I want to all the other nodes while leaving the namespaces untouched? I see the disable-output-escaping attribute might be an option, but I haven't been able to make it work.

Looks like this is an Oracle-specific issue. I'll probably post this on the Oracle Forums then.
Thanks for all the feedback!

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

How to compute a unicode string which bidirectional representation is specified?

fellows. I have a rather pervert question. Please forgive me :)
There's an official algorithm that describes how bidirectional unicode text should be presented.
http://www.unicode.org/reports/tr9/tr9-15.html
I receive a string (from some 3rd-party source), which contains latin/hebrew characters, as well as digits, white-spaces, punctuation symbols and etc.
The problem is that the string that I receive is already in the representation form. I.e. - the sequence of characters that I receive should just be presented from left to right.
Now, my goal is to find the unicode string which representation is exactly the same. Means - I need to pass that string to another entity; it would then render this string according to the official algorithm, and the result should be the same.
Assuming the following:
The default text direction (of the rendering entity) is RTL.
I don't want to inject "special unicode characters" that explicitly override the text direction (such as RLO, RLE, etc.)
I suspect there may exist several solutions. If so - I'd like to preserve the RTL-looking of the string as much as possible. The string usually consists of hebrew words mostly. I'd like to preserve the correct order of those words, and characters inside those words. Whereas other character sequences may (and should) be transposed.
One naive way to solve this is just to swap the whole string (this takes care of the hebrew words), and then swap inside it sequences of non-hebrew characters. This however doesn't always produce correct results, because actual rules of representation are rather complex.
The only comprehensive algorithm that I see so far is brute-force check. The string can be divided into sequences of same-class characters. Those sequences may be joined in random order, plus any of them may be reversed. I can check all those combinations to obtain the correct result.
Plus this technique may be optimized. For instance the order of hebrew words is known, so we only have to check different combinations of their "joining" sequences.
Any better ideas? If you have an idea, not necessarily the whole solution - it's ok. I'll appreciate any idea.
Thanks in advance.

If you want to check if a character is Bidirectional you have to use UCD (Unicode Character Database) which provided by Unicode.org and includes lots of information about characters . in one of that DB attributes you can find the Bidirectionality of a character
So you have to Download USD , then write a class to look for your character in the XML and return answer
I did this in an opensource C# application and you can ind it here http://Unicode.Codeplex.com
Please let me know has your issue resolved by this or not.

Nasser, thanks for the answer.
Unfortunately it doesn't fully resolve my problem.
So far for every character I can know its directionality. Still I don't see how can I compute the whole string so that its representation would match what I need.
Imagine you want to have the following text written from left to right, whereas hebrew/arabic characters are denoted by BIG:
ABC eng 123 456 DEF
The correct string would be like this:
FED 456 123 eng CBA
or also:
FED eng 456 123 CBA
Or, if using explicit direction override codes it can be written like this:
FED eng 123 456 CBA
Currently I solved this problem by injecting explicit directionality override codes into the string. So that I isolate sequences of hebrew/arabic words, and for all the joining LTR/Weak/Neutral characters I explicitly override the direction to LTR.
However I'd like to do this without injecting explicit override codes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js