Capturing group for sentences with alternatives - regex

I want to capture these two French sentences.
Je ne pense pas que
Pensez-vous que
Please correct my RE.
(je )? (ne )?(?: pense|pensez)(?: -)?(vous )?(?: pas)? que

Why not simple?
/(Je ne pense pas|Pensez-vous) que/
Or even simpler
/(Je ne pense pas que|Pensez-vous que)/

Related

Regex multiple detection of same matching group

i am having trouble with regex for dectecting all characters between the keyword "QUESTION"
I want to select all Question but i couldn't select a Question already present in the first match
when i use this regex (the result is bold):
(Question |QUESTION |QCM )(.)*?(Question |QUESTION |QCM )
QUESTION N°%6 : A PROPOS DE LA MYOLOGIE DE L EXTREMITE
CEPHALIQUE :
C. Le nerf facial se termine dans la loge submandibulaire. |
D. Tous les muscles peauciers sont innerves par le nerf facial. . risa baile
E. La contraction du muscle platysma entraine un abaissement de la lévre inférieure.
QUESTION N°7 : A PROPOS DE LA MYOLOGIE DE L'EXTREMITE
CEPHALIQUE : Re ear al 30
A. Le muscle buccinateur est innervé par le nerf mandibulaire. | oe ee
B. La contraction du muscle élévateur nasolabial entraine une constriction de la narine.
QUESTION N°8 : A PROPOS DE L'ARTICULATION TEMPOROMANDIBULAIRE :
A. Les articulations temporo-mandibulaires sont de type sphéroide.
mandibulaire.
Question
i need to match with all Questions. thank you
You could write the pattern with an assertion and the capture group around the whole matching part.
\b(Question|QUESTION|QCM)\s+(.*?)(?=\s+(?:Question|QUESTION|QCM)\s|$)
Explanation
\b A word boundary
(Question|QUESTION|QCM) Capture any of the alternatives in group 1
\s+ Match 1+ word characters
(.*?) Capture any character in group 2, as few as possible
(?=\s+(?:Question|QUESTION|QCM)\s|$) Assert that to the right is either a new variation of question between whitespace chars, or the the end of the string
Regex demo

Regex to match string starting and between specific word

I have this string :
<a href="/article/aujourd-hui-moment-calin-avec-mon-copain-attache-et-a-4-pattes-il-finis-en-moi-et-recoit-u_267211.html">
Aujourd’hui, moment à la fois câlin et torride avec mon copain. On se fait un petit délire BDSM et, me retrouvant à 4 pattes, il m&apos;attache. Après cette session où on en a fini, il reçoit un appel urgent et part. En me laissant comme ça. VDM
</a>
and I would like to get this one :
Aujourd’hui, moment à la fois câlin et torride avec mon copain. On se fait un petit délire BDSM et, me retrouvant à 4 pattes, il m&apos;attache. Après cette session où on en a fini, il reçoit un appel urgent et part. En me laissant comme ça. VDM
I have made research and succeed with this regular expression
[^>]+(?=\<)
the problem is that I have other String like that :
Aléatoire <span class="rub_icon icon-dice"></span>
with this String and the Regex I get Aléatoire and that is not good.
So I want to improve the Regex to GET ONLY the Entire sentence that BEGINS with Aujourd’hui
Can someone have a solution ? I am not use to Regex.
In Sed, to only print lines not starting with a tag, you can use:
sed -n '/^[^<].*$/p' fr.html
Aujourd’hui, moment à la fois câlin et torride avec mon copain. On se fait un petit délire BDSM et, me retrouvant à 4 pattes, il m&apos;attache. Après cette session où on en a fini, il reçoit un appel urgent et part. En me laissant comme ça. VDM
Or you could do the opposite thing twice, delete lines starting with a tag:
sed '/^<.*$/d' fr.html
Aujourd’hui, moment à la fois câlin et torride avec mon copain. On se fait un petit délire BDSM et, me retrouvant à 4 pattes, il m&apos;attache. Après cette session où on en a fini, il reçoit un appel urgent et part. En me laissant comme ça. VDM
so, based on your explanation:
>\s?(Aujourd’hui.*?)\s?<
>< specifies that content is between brackets (outside of html)
\s? specifies that there may be, but doesnt have to be whitespace
without:
<a>string</a>
with:
<a>
string
</a>
Aujourd’hui specifies match has to start with this word
.*? specifies optional additional characters in string
i hope the order is obvious.
edit: to avoid confusion, we are talking about _match functions, with
full regex being />\s?(Aujourd’hui.*?)\s?</g.
https://regex101.com/r/F0bPWN/2

regex: how to stop on no alpha or end of line char?

My goal is to match both:
25 place de la paix
24 place de la guerre. Do not continue after .
26 place de la foi !do not continue after !
Should give 3 results:
25 place de la paix
24 place de la guerre
26 place de la foi
I use:
/\d+\splace.*[^a-z\s]/iU
which works fine for
24 place de la guerre.
Since it stopps at a none alpha numeric char "."
I would like to stop the regex on no alpha OR at end of line char: any idea ?
I tried with
/\d+\splace.*[^a-z\s\n]/iU
/\d+\splace.*[^a-z\s\r]/iU
You don't need to use .* after place. You can just use [a-z\s]* to match what you want:
/\b\d+\s+place[a-z\s]*/i
RegEx Demo
Or else use negative lookahead to stop when you encounter first non-letter, non-space character:
/\b\d+\s+place.*?(?=[^a-z\s]|$)/mi
\s includes space, tabs and line breaks. That's why when you used \s in [^a-z\s]. It also negates matching on new line. You can use this:
/\d+ place de la \w+/
to match all of these:
25 place de la paix
24 place de la guerre
26 place de la foi
use a non-capturing with spaces followed by alpha characters:
/\d+\h+place(?:\h+[a-z]+)*/i
demo
Note: most of the time, the U modifier is totally useless.

Diacritics and regular expressions in R

In R I have a column which should contain only one word. It is created by taking the contents of another column and with regex only keeping the last word. However, for some rows this doesn't work in which case R simply copies the content from the first column. Here is my R
df$precedingWord <- gsub(".*?\\W*(\\w+-?)\\W*$","\\1", df$leftContext, perl=TRUE)
precedingWord should only hold one word. It is extracted from leftContext with regex. This works fine overall, but not with diacritics. A couple of rows in leftContext have letters with diacritics such as é and à. For some reason R ignores these items completely and simply copies the whole thing to precedingWord. I find this odd, because it is practically impossible that the regex matches the whole thing - as you can see here. In the example, Test string is leftContext and Substitution should be *precedingWord.
As you see in the example above, the output in the online regex tester is different from the output I get. I simply get an exact copy of leftContext. This does not mean that the output in the online tester is what want. Now the tool considers letters with diacritics as non-word characters and thus it doesn't mark it as the output that I want. But actually, I want to threat them as word characters so they are eligible for output.
If this is the input:
Un premier projet prévoit que l'établissement verserait 11 FF par an et par élève du secondaire et 30 FF par étudiant universitaire, une somme à évaluer et à
Outre le prêt-à-
And à
Sur base de ces données, on cherchera à
Ce sera encore le cas ce vendredi 19 juillet dans l'é
Then this is the output I expect
à
prêt-à-
à
à
é
This is the regex I already have
.*?\W*(\w+?-?)\W*$
I'm already using stringi in my project, so if that provides a solution I could use that.
In Perl-like regex, you can match any Unicode letter with \p{L} shorthand class, and all characters that are non-Unicode can be matched with the reverse class \P{L}. See regular-expressions.info:
You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}.
Thus, the regex you can use is
df$precedingWord <- gsub(".*?\\P{L}*(\\p{L}+-?)\\P{L}*$","\\1", df$leftContext, perl=TRUE)

Get shortest match with regex - lazy quantifier

I am trying to extract some strings from a legal text where the patterns are repeated several times.
I am not sure I understand how the lazy quantifier (?) works. From what I read it is supposed to capture a match using as few characters as possible. However it doesnt seem to do that in my example below:
Sorry for the text in spanish, but I guess it is simple enough to follow.
...por la afirmativa.los señores jueces doctores genoud, hitters, de
lazzari, roncoroni y soria, por los mismos fundamentos de la señora
jueza doctora kogan, votaron la primera cuestion planteada tambien por
la negativa.a la tercera cuestion planteada, la señora jueza doctora
kogan dijo:..(text)...voto por la afirmativa.los señores jueces
doctores genoud e hitters, por los mismos fundamentos de la señora
jueza doctora kogan, votaron la tercera cuestion planteada por la
afirmativa.a la tercera cuestion planteada, el señor juez doctor de
lazzari dijo:...
I am trying to capture the text between the strings "los señores jueces" (line 4) and "votaron la tercera cuestion planteada por la afirmativa" . There are two matches for this pattern as the string "los señores jueces" appears twice, once at the beginning and then in line 4.
So I try to use the lazy quantifier (.*?) to get the shortest of the 2 matches:
(los señores jueces(.*?)votaron la tercera cuestion planteada por la afirmativa)
But it doesnt seem to work, it matches the longest string, starting from line 1 and not from the second (shortest) occurrence. I am testing the regex on https://regex101.com/
Apreciate any help with this.
Thanks.
Use a negative lookahead to force the regex engine to check that there isn't a string los señores jueces present, before matching each character.
los señores jueces((?:(?!los señores jueces).)*?)votaron la tercera cuestion planteada por la afirmativa
DEMO