How can I solve this regex using two asserts? - regex

I have these 3 consecutive words : Nocivic Voie and Quartier
I have something like this :
#Nocivic;Voie;Quartier#
Question :
I need make a regex to extract the 3 words Nocivic Voie and Quartier using positive lookahead and the commas need to be included in my regex but not the #.
I realized that this could work : \bNocivic(?=;Voie);\bVoie;Quartier
But why is this not working ?
\bNocivic(?=;Voie);\bVoie(?<=Voie;)\bQuartier
I am not too experienced with regex so if someone could tell me why or give me the correct answer if I really wanted to use another lookbehind would be greatly appreciated thanks.

First one is equivelent to
\bNocivic;Voie;Quartier\b
(?=;Voie) just tests if ;Voie follows Nocivic, no useful here
Extrac from
https://www.regextutorial.org/positive-and-negative-lookahead-assertions.php
They only assert if in a given test string the match with certain conditions is possible or not Yes or No.
See the difference below
Nocivic;Voie Ok & returns Nocivic;Voie
Nocivic(=?;Voie) Ok & returns Nocivic
Second one :
?< is not a valid command

The second one is not working, as after match Voie you assert that from the current position there should be Voie; to the left using (?<=Voie;) but you have not matched the semi colon yet.
Note that the lookaround assertions are fruitless in the example, as you are asserting what you are also matching.
If you want to match exactly those 3 words, it does not make sense to use lookarounds.
You can use 3 capture groups:
#(Nocivic);(Voie);(Quartier)#
Regex demo

Related

Find any 4 consecutive characters between two strings

I'm trying to write a regex that would detect if any combination of 4 non-whitespace characters existed between two strings. They will always be seperated by a comma. An example:
Labrador, Matador ---> this would match 'ador'.
Mississippi, Missing ---> This would match 'Miss' and 'issi'
Corporate, Corporation ---> This would match 'Corp' , 'orpo' , 'rpor' , 'pora' and 'orat'
It's been pretty hard to find something similar to this, and the closest I've found has said this is not possible in regex. It's definitely tricky, but I wanted to make sure that it was in fact not possible before looking for a different solution.
If it is impossible, would someone explain why?
For overlapping matches it is possible with a lookahead:
/(?=(\S{4}).*,.*\1)/
Note that there is one more issi possible in your second line example.
Test: https://regex101.com/r/rV3gN9/2
You can use this lookahead based regex:
(?=([a-zA-Z]{4})[a-zA-Z]*, *[a-zA-Z]*\1)
RegEx Demo
Though it will find issi twice since Mississippi has 2 instanced of issi.
This can be achieved with backreferences:
\w*([a-zA-z]{4})\w*, \w*\1\w*
See example: https://regex101.com/r/eW8hB7/1

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

Smallest possible match / nongreedy regex search

I first thought that this answer will totaly solve my issue, but it did not.
I have a string url like this one:
http://www.someurl.com/some-text-1-0-1-0-some-other-text.htm#id_76
I would like to extract some-other-text so basically, I come with the following regex:
/0-(.*)\.htm/
Unfortunately, this matches 1-0-some-other-text because regex are greedy. I can not succeed make it nongreedy using .*?, it just does not change anything as you can see here.
I also tried with the U modifier but it did not help.
Why the "nongreedy" tip does not work?
In case you need to get the closest match, you can make use of a tempered greedy token.
0-((?:(?!0-).)*)\.htm
See demo
The lazy version of your regex does not work because regex engine analyzes the string from left to right. It always gets leftmost position and checks if it can match. So, in your case, it found the first 0-and was happy with it. The laziness applies to the rightmost position. In your case, there is 1 possible rightmost position, so, lazy matching could not help achieve expected results.
You also can use
0-((?!.*?0-).*)\.htm
It will work if you have individual strings to extract the values from.
You want to exclude the 1-0? If so, you can use a non capturing group:
(?:1-0-)+(.*?)\.htm
Demo

Regular Expressions, getting digit after second occurence of dot

I want to get a number after second dot in a string like that :
4.5.3. Some kind of question ? but input string might look like this as well 41.53.32. Some kind of question ? so im aiming for 3 in the first example and 32 in second example.
I'm trying to do it with
(?<=(\.\d\.))[0-9]+
and it works on 1st example, but when im trying to add (?<=(\.\d+\.))[0-9]+
it doesn't work at all.
If there is always a dot after the final number then you can use the following expression:
\d+(?=\.(?:[^\d]|$))
This will match one or more digits \d+ which are followed by a dot . then something that is either not a number [^\d] of the end-of-string $, i.e. (?=\.(?:[^\d]|$)).
Regex101 Demo
If you use PERL or PHP, you can try this pattern:
(?:\d+\.){2}\K\d+
The simplest complete answer is probably something like this:
(?<=^(?:[^.]*\.){2})\d+
If you're at all worried about performance, this one will be slightly faster:
^(?:[^.]*\.){2}(\d+)
This one will capture the desired value in capturing group 1.
If you are using an engine that doesn't support variable-length lookbehind, you'll need to use the second version.
If you wish, you can replace [^.] with \d, to only match digits.
(\d+.\d+.)\K\d+
Match digits dot digits dot digits, with the first section as a group not selected.
(?:(?:.*\.)?){2}(\d+)
the following regex should work for your use case.
check it out here

Regex to first occurrence only? [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
Let's say I have the following string:
this is a test for the sake of
testing. this is only a test. The end.
and I want to select this is a test and this is only a test. What in the world do I need to do?
The following Regex I tried yields a goofy result:
this(.*)test (I also wanted to capture what was between it)
returns this is a test for the sake of testing. this is only a test
It seems like this is probably something easy I'm forgetting.
The regex is greedy meaning it will capture as many characters as it can which fall into the .* match. To make it non-greedy try:
this(.*?)test
The ? modifier will make it capture as few characters as possible in the match.
Andy E and Ipsquiggle have the right idea, but I want to point out that you might want to add a word boundary assertion, meaning you don't want to deal with words that have "this" or "test" in them-- only the words by themselves. In Perl and similar that's done with the "\b" marker.
As it is, this(.*?)test would match "thistles are the greatest", which you probably don't want.
The pattern you want is something like this: \bthis\b(.*?)\btest\b
* is a greedy quantifier. That means it matches as much as possible, i.e. what you are seeing. Depending on the specific language support for regex, you will need to find a non-greedy quantifier. Usually this is a trailing question mark, like this: *?. That means it will stop consuming letters as soon as the rest of the regex can be satisfied.
There is a good explanation of greediness here.
For me, simply remove /g worked.
See https://regex101.com/r/EaIykZ/1