Q: Regex Challenge REAL tag-striper - regex

Q: Regex Challenge REAL tag-striper[en] With one (or two) regex get the result: TEXTO1|TEXTO2|TEXTO3 e TEXTO4|TEXTO5|TEXTO6[/en][pt] Com uma (ou duas) regex obter como resultado: TEXTO1|TEXTO2|TEXTO3 e TEXTO4|TEXTO5|TEXTO6[/pt][en] From the string below between INICIO and FIM[/en]:[pt] Da string abaixo entre INICIO e FIM[pt] INICIO<aaa>TEXTO1</aaa><bbb></bbb><?xml:proriety>TEXTO2</xml:proriety="atribute"><aaa>TEXTO3 e TEXTO4</aaa>entre-tags-não-importam<bbb>TEXTO5</bbb><bbb>TEXTO6</bbb>FIM

Eventually, I'll learn how to use stackoverflow formatting style, some time. But for now:
https://docs.google.com/document/d/1lxtDvHUmnMlZJq0vars4U72QyrYjBtE68YDWaHioiiU/edit?usp=sharing

Related

Extract everything inside quotation marks but keep quoted content quoted

I have the following case
"_,\"'() is a marker of \"'_( ,)\"."
and I want to extract this string with Regex such that:
_,\"'() marker \"'_( ,)\"
matches.
Another example for better readability (however the previous example is more important for the use case)
"Test is a marker for 'testing'"
which should result in
Test marker testing
G is a abbreviation for "GDP ('Gross Domestic Product')"
G abbreviation GDP ('Gross Domestic Product')
There are only two options either marker or abbreviation.
My current regex is the following:
/(.*)+ is the (father|mother) of (?:"([^,]*)")./
But it doesn't work with the first example.
Any help is much appreciated.
You could use
(\w+) is a (marker|abbreviation) for ("[^"]*?"|'[^']*?')
R Example
library(stringr)
convert <- function(s) {
res <- str_match(s, "(\\w+) is a (marker|abbreviation) for (\"[^\"]*?\"|'[^']*?')")
return <- paste(res[2], res[3], substr(res[4], 2, nchar(res[4])-1))
}
print(convert("Test is a marker for 'testing'")) # Test marker testing
print(convert("G is a abbreviation for \"GDP ('Gross Domestic Product')\"")) # G abbreviation GDP ('Gross Domestic Product')
Also, see the demo of the regex
P.S. As most of your questions were about R language, I thought to show an example exactly in R. Hope it is helpful.

Is this a bug in ruby Regexp? How to guard against "infinite loop" from regex match without using Timeout?

I have this regex:
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i
And when I use it on some, but not all, texts e.g. this one:
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"
like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)
ruby version: 2.7.2
Built-in character classes are more table-driven.
Given that, Negative built-in ones like \W, \S etc...
are difficult for engines to merge into a positive character class.
In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.
In fact, [a-xzA-XZ\W] works given the sample string. It times out when Y is included anywhere
but just for that particular string.
Let's see if we can determine if this is a bug or not.
First, some tests:
Test - Fail [a-zA-Z\W]
https://rextester.com/FHUQG84843
# Test - Fail [a-zA-Z\W]
puts "Hello World!";
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
res = text.match(regex);
puts "Done";
Test - Pass [a-xzA-XZ\W]
https://rextester.com/RPV28606
Test - Pass [a-zA-Z\P{Word}]
https://rextester.com/DAMW9069
Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class \W which is engine defined,
since \P{Word} is a Unicode property defined function, not a range.
And we see that [a-zA-Z\P{Word}] works just fine.
Use \P{Word} inside classes as a temporary workaround.
In reality when modern-day engines were first designed, the logic of what
a negative class was [^] each item is AND NOT which when combined with a positive
class where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.

Regex, can't get it

Seems easy but it doesn't work. I have something like this :
5224Reportage chez Ben ferme Ayrshire 2000 inc.2009-08-26T00:00:00-04:00En 2001, plutôt que de prendre le chemin de l’expansion, Ben ferme Ayrshire 2000 d’Hébertville au Lac-Saint-Jean a décidé d’ajouter le volet fromagerie à l’entreprise en misant sur la qualité.Revue/PLQ-2009-09/reportage.pdf5144Un deuxième Revue/PLQ-2014-07/production.pdf
From this I need an array containing :
Revue/PLQ-2009-09/reportage.pdf
Revue/PLQ-2014-07/production.pdf
I used :
$pdfResult = array();
preg_match_all('/^Revue.*pdf$/',$string, $pdfResult);
It returns nothing...
.* is greedy by default. You need to make it non-greedy by adding ? quantifier next to *. And you don't need to put anchors, since the strings you want isn't at the start.
preg_match_all('~Revue.*?\.pdf~',$string, $pdfResult);
DEMO

NSRegularExpression match is not working

I'm trying to replace some escaped unicode in an NSString. I haven't had any luck with the CFString functions, so I thought I would try regular expressions.
Here is the regex
NSRegularExpression *regexUnicode2 = [NSRegularExpression regularExpressionWithPattern:#"(\\u([0-9A-Fa-f]){4}){2}" options:0 error:&error];
Then I try to get matches using this
NSArray *twoEscapeArray = [regexUnicode2 matchesInString:selfCopy options:0 range:NSMakeRange(0, self.length)];
selfCopy is a mutable copy of the input string. Here is a piece of that string:
muestran al p\u00c3\u00bablico las encuadernaciones de las colecciones
reales adem\u00c3\u00a1s de otros objetos hist\u00c3\u00b3ricos en
relaci\u00c3\u00b3n con \u00c3\u00a9stas.La muestra,
considerada a nivel mundial como uno de los conjuntos ligatorios
hist\u00c3\u00b3ricos m\u00c3\u00a1s importantes, se completa con
obras de arte como armas, alfombras y relojes. Estos son objetos que
ayudan a entender la encuadernaci\u00c3\u00b3n como elemento
fundamental de la cultura de corte.Los fondos de la Real
Biblioteca, del Real Monasterio de San Lorenzo de El Escorial, del
Monasterio de Santa Mar\u00c3\u00ada la Real de las Huelgas de Burgos,
del Monasterio de las
Without proper conversion, these escaped unicode pairs are being treated as individual characters (each pair produces two characters) when I put them into a UIWebView.
This is how the raw JSON data is coded, and I haven't had any luck getting it to convert to Latin characters properly.
Anyway, the problem is that the array twoEscapeArray is nil after the match attempt. I'm not sure why.
You mean \u00c3\u00ba is getting converted to ú? That looks like the correct behavior to me. The real question is how those Unicode escapes got in there. It looks like the text was decoded incorrectly at some point (possibly when the NSString was created?), and what should have been the two-byte UTF-8 encoding of the letter ú (U+00FA, Latin Small Letter U With Acute) was decoded as two characters.
Try going back to where you created the NSString, this time specifying UTF-8 as the encoding.

Extract address from description with regex

Im trying to extract an address (written in french) out of a listing using regex.
here is the example:
"Don't wait, this home won't be on the market for long!
Pictures can be forwarded upon request.
123 de la street - city
345-555-1234 "
Imagine that whole thing is item.description. Here is a working set so far:
In "item.description", replace "^\d{1,4} des|de la|du [^,\s]+$" with "whatever"
and the address (123 de la street) will be correctly written over with whatever. BUT if I try to make it the only thing kept from the description, something like this (which dosent work):
In "item.description" replace "(.)(^\d{1,4} des|de la|du [^,\s]+$)(.)" with "$2"
What would be the best way to replace the whole description with just the address?
Thanks!
Try adding * to the first and last token, plus watch out for ^$ signs! (They match start and end of the text.)
"^(.*)(\d{1,4} des|de la|du [^,\s]+)(.*)$"