NSRegularExpression match is not working - regex

I'm trying to replace some escaped unicode in an NSString. I haven't had any luck with the CFString functions, so I thought I would try regular expressions.
Here is the regex
NSRegularExpression *regexUnicode2 = [NSRegularExpression regularExpressionWithPattern:#"(\\u([0-9A-Fa-f]){4}){2}" options:0 error:&error];
Then I try to get matches using this
NSArray *twoEscapeArray = [regexUnicode2 matchesInString:selfCopy options:0 range:NSMakeRange(0, self.length)];
selfCopy is a mutable copy of the input string. Here is a piece of that string:
muestran al p\u00c3\u00bablico las encuadernaciones de las colecciones
reales adem\u00c3\u00a1s de otros objetos hist\u00c3\u00b3ricos en
relaci\u00c3\u00b3n con \u00c3\u00a9stas.La muestra,
considerada a nivel mundial como uno de los conjuntos ligatorios
hist\u00c3\u00b3ricos m\u00c3\u00a1s importantes, se completa con
obras de arte como armas, alfombras y relojes. Estos son objetos que
ayudan a entender la encuadernaci\u00c3\u00b3n como elemento
fundamental de la cultura de corte.Los fondos de la Real
Biblioteca, del Real Monasterio de San Lorenzo de El Escorial, del
Monasterio de Santa Mar\u00c3\u00ada la Real de las Huelgas de Burgos,
del Monasterio de las
Without proper conversion, these escaped unicode pairs are being treated as individual characters (each pair produces two characters) when I put them into a UIWebView.
This is how the raw JSON data is coded, and I haven't had any luck getting it to convert to Latin characters properly.
Anyway, the problem is that the array twoEscapeArray is nil after the match attempt. I'm not sure why.

You mean \u00c3\u00ba is getting converted to ú? That looks like the correct behavior to me. The real question is how those Unicode escapes got in there. It looks like the text was decoded incorrectly at some point (possibly when the NSString was created?), and what should have been the two-byte UTF-8 encoding of the letter ú (U+00FA, Latin Small Letter U With Acute) was decoded as two characters.
Try going back to where you created the NSString, this time specifying UTF-8 as the encoding.

Related

Is this a bug in ruby Regexp? How to guard against "infinite loop" from regex match without using Timeout?

I have this regex:
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i
And when I use it on some, but not all, texts e.g. this one:
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"
like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)
ruby version: 2.7.2
Built-in character classes are more table-driven.
Given that, Negative built-in ones like \W, \S etc...
are difficult for engines to merge into a positive character class.
In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.
In fact, [a-xzA-XZ\W] works given the sample string. It times out when Y is included anywhere
but just for that particular string.
Let's see if we can determine if this is a bug or not.
First, some tests:
Test - Fail [a-zA-Z\W]
https://rextester.com/FHUQG84843
# Test - Fail [a-zA-Z\W]
puts "Hello World!";
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
res = text.match(regex);
puts "Done";
Test - Pass [a-xzA-XZ\W]
https://rextester.com/RPV28606
Test - Pass [a-zA-Z\P{Word}]
https://rextester.com/DAMW9069
Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class \W which is engine defined,
since \P{Word} is a Unicode property defined function, not a range.
And we see that [a-zA-Z\P{Word}] works just fine.
Use \P{Word} inside classes as a temporary workaround.
In reality when modern-day engines were first designed, the logic of what
a negative class was [^] each item is AND NOT which when combined with a positive
class where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.

Regex, Remove text NOT between tags

I need to remove all text that is not between the tags <p> and </p>. There can be many <p> tags in each cell. The content before <p> and after </p> is different in each row.
Example
<h1>Curly Krans Daggdroppar 30cm LED</h1><h2>Beskrivning</h2><div id="more_info_sheets" class="sheets align_justify"><div id="idTab1" class="rte"><div id="more_info_sheets" class="sheets align_justify"><div id="idTab1" class="rte"><p>En krans med en snygg och intressant design. </p><p>Kransen har 30st ej utbytbara små LED lampor.</p><p>Finns i tre olika färger, välj mellan, koppar, mässing och krom.</p></div></div></div></div>
Should be
<p>En krans med en snygg och intressant design. </p><p>Kransen har 30st ej utbytbara små LED lampor.</p><p>Finns i tre olika färger, välj mellan, koppar, mässing och krom.</p>
Anyone know how to do this?
You can use the match expression to only capture the desired group of tags instead of replacing the rest of the text. However here it is the other option for the regular expresion:
Match all your p groups
<p>.*<\/p>
Match each p group separatedly
<p>.*?<\/p>
Match non p groups
(^.*?(?=<p>))|((?<=<\/p>)<[^p].*)

Q: Regex Challenge REAL tag-striper

Q: Regex Challenge REAL tag-striper[en] With one (or two) regex get the result: TEXTO1|TEXTO2|TEXTO3 e TEXTO4|TEXTO5|TEXTO6[/en][pt] Com uma (ou duas) regex obter como resultado: TEXTO1|TEXTO2|TEXTO3 e TEXTO4|TEXTO5|TEXTO6[/pt][en] From the string below between INICIO and FIM[/en]:[pt] Da string abaixo entre INICIO e FIM[pt] INICIO<aaa>TEXTO1</aaa><bbb></bbb><?xml:proriety>TEXTO2</xml:proriety="atribute"><aaa>TEXTO3 e TEXTO4</aaa>entre-tags-não-importam<bbb>TEXTO5</bbb><bbb>TEXTO6</bbb>FIM
Eventually, I'll learn how to use stackoverflow formatting style, some time. But for now:
https://docs.google.com/document/d/1lxtDvHUmnMlZJq0vars4U72QyrYjBtE68YDWaHioiiU/edit?usp=sharing

Regex, can't get it

Seems easy but it doesn't work. I have something like this :
5224Reportage chez Ben ferme Ayrshire 2000 inc.2009-08-26T00:00:00-04:00En 2001, plutôt que de prendre le chemin de l’expansion, Ben ferme Ayrshire 2000 d’Hébertville au Lac-Saint-Jean a décidé d’ajouter le volet fromagerie à l’entreprise en misant sur la qualité.Revue/PLQ-2009-09/reportage.pdf5144Un deuxième Revue/PLQ-2014-07/production.pdf
From this I need an array containing :
Revue/PLQ-2009-09/reportage.pdf
Revue/PLQ-2014-07/production.pdf
I used :
$pdfResult = array();
preg_match_all('/^Revue.*pdf$/',$string, $pdfResult);
It returns nothing...
.* is greedy by default. You need to make it non-greedy by adding ? quantifier next to *. And you don't need to put anchors, since the strings you want isn't at the start.
preg_match_all('~Revue.*?\.pdf~',$string, $pdfResult);
DEMO

Extract address from description with regex

Im trying to extract an address (written in french) out of a listing using regex.
here is the example:
"Don't wait, this home won't be on the market for long!
Pictures can be forwarded upon request.
123 de la street - city
345-555-1234 "
Imagine that whole thing is item.description. Here is a working set so far:
In "item.description", replace "^\d{1,4} des|de la|du [^,\s]+$" with "whatever"
and the address (123 de la street) will be correctly written over with whatever. BUT if I try to make it the only thing kept from the description, something like this (which dosent work):
In "item.description" replace "(.)(^\d{1,4} des|de la|du [^,\s]+$)(.)" with "$2"
What would be the best way to replace the whole description with just the address?
Thanks!
Try adding * to the first and last token, plus watch out for ^$ signs! (They match start and end of the text.)
"^(.*)(\d{1,4} des|de la|du [^,\s]+)(.*)$"