Diacritics and regular expressions in R - regex

In R I have a column which should contain only one word. It is created by taking the contents of another column and with regex only keeping the last word. However, for some rows this doesn't work in which case R simply copies the content from the first column. Here is my R
df$precedingWord <- gsub(".*?\\W*(\\w+-?)\\W*$","\\1", df$leftContext, perl=TRUE)
precedingWord should only hold one word. It is extracted from leftContext with regex. This works fine overall, but not with diacritics. A couple of rows in leftContext have letters with diacritics such as é and à. For some reason R ignores these items completely and simply copies the whole thing to precedingWord. I find this odd, because it is practically impossible that the regex matches the whole thing - as you can see here. In the example, Test string is leftContext and Substitution should be *precedingWord.
As you see in the example above, the output in the online regex tester is different from the output I get. I simply get an exact copy of leftContext. This does not mean that the output in the online tester is what want. Now the tool considers letters with diacritics as non-word characters and thus it doesn't mark it as the output that I want. But actually, I want to threat them as word characters so they are eligible for output.
If this is the input:
Un premier projet prévoit que l'établissement verserait 11 FF par an et par élève du secondaire et 30 FF par étudiant universitaire, une somme à évaluer et à
Outre le prêt-à-
And à
Sur base de ces données, on cherchera à
Ce sera encore le cas ce vendredi 19 juillet dans l'é
Then this is the output I expect
à
prêt-à-
à
à
é
This is the regex I already have
.*?\W*(\w+?-?)\W*$
I'm already using stringi in my project, so if that provides a solution I could use that.

In Perl-like regex, you can match any Unicode letter with \p{L} shorthand class, and all characters that are non-Unicode can be matched with the reverse class \P{L}. See regular-expressions.info:
You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}.
Thus, the regex you can use is
df$precedingWord <- gsub(".*?\\P{L}*(\\p{L}+-?)\\P{L}*$","\\1", df$leftContext, perl=TRUE)

Related

Regex multiple detection of same matching group

i am having trouble with regex for dectecting all characters between the keyword "QUESTION"
I want to select all Question but i couldn't select a Question already present in the first match
when i use this regex (the result is bold):
(Question |QUESTION |QCM )(.)*?(Question |QUESTION |QCM )
QUESTION N°%6 : A PROPOS DE LA MYOLOGIE DE L EXTREMITE
CEPHALIQUE :
C. Le nerf facial se termine dans la loge submandibulaire. |
D. Tous les muscles peauciers sont innerves par le nerf facial. . risa baile
E. La contraction du muscle platysma entraine un abaissement de la lévre inférieure.
QUESTION N°7 : A PROPOS DE LA MYOLOGIE DE L'EXTREMITE
CEPHALIQUE : Re ear al 30
A. Le muscle buccinateur est innervé par le nerf mandibulaire. | oe ee
B. La contraction du muscle élévateur nasolabial entraine une constriction de la narine.
QUESTION N°8 : A PROPOS DE L'ARTICULATION TEMPOROMANDIBULAIRE :
A. Les articulations temporo-mandibulaires sont de type sphéroide.
mandibulaire.
Question
i need to match with all Questions. thank you
You could write the pattern with an assertion and the capture group around the whole matching part.
\b(Question|QUESTION|QCM)\s+(.*?)(?=\s+(?:Question|QUESTION|QCM)\s|$)
Explanation
\b A word boundary
(Question|QUESTION|QCM) Capture any of the alternatives in group 1
\s+ Match 1+ word characters
(.*?) Capture any character in group 2, as few as possible
(?=\s+(?:Question|QUESTION|QCM)\s|$) Assert that to the right is either a new variation of question between whitespace chars, or the the end of the string
Regex demo

How can I replace wrong spaces in a text using REGEX?

I am trying to figure out how to replace spaces in a text like the example below but I don't know how to deal with different number of spaces in the same text
This text:
E m se guida, a e mpre sa deu ba ixa e m
cerca de $82 b ilhões ( ma is de 75 %) de se us a t ivos.
Should be:
Em seguida, a empresa deu baixa em
cerca de $82 bilhões (mais de 75%) de seus ativos.
Note that there are single spaces between characters and double spaces between words.
Could someone give me some light on this?
I would approach this in two steps. First, I would use a regex to replace all of the single spaces, and then another to shorten the double spaces. To find only single spaces, you can use this regex:
(\S)\s(\S)
Next, to find double spaces, you can use this regex:
\s\s+
So first, replace single spaces with groups one and two from the first regex, and then replace double spaces with a single space using the second regex.
Using the atom editor, you can use these two regex to find and replace like this:
In the second image, you do have to enter one space, it is slightly unclear from the screen shot. Hope this helps!

regular expression for exact number in a numerical string

I have been trying to create a regular expression for the following problem:
A) PAR
B) 1234
given strings A and B above, i want to find all matches where those values occur in order regardless of white space, etc with the following important rules:
both strings A and B cannot exist as a substring to another larger string
the given string B must occur after A
the given string B must occur by itself and not be a part of another number
here are some example potential matches:
PAR1234
PAR 1234
PAR 5678, 1234
PAR 9991234999, 1234
PAR !##-= 1234
PAR1234-122
PAR#1234-233
ANY TEXT PAR#1234-233
however, the following should not match:
PART 1234 - PAR is substring of PART
APART 1234 - PAR is substring of APART
PAR 1234999 - 1234 is substring of 1234999
PAR 9991234 - 1234 is substring of 9991234
PAR 9991234999 - 1234 is substring of 9991234999
1234 PAR - 1234 occurs before PAR
unfortunately, i am trying to do this using REGEXP_LIKE in oracle and there is no \b
i tried
\W*PAR\W*1234
but that won't match #3 in the potential matches above. so ive attempted many variations that will work for some but not all.
i was wondering if there is an expression that could capture what i am trying to accomplish. any help would be greatly appreciated.
thanks.
This solution uses \b to check for a word boundary.
\bPAR1234\b|\bPAR\b.*\b1234\b
See the demo here: https://regex101.com/r/SM8Bq1/2

regex: how to stop on no alpha or end of line char?

My goal is to match both:
25 place de la paix
24 place de la guerre. Do not continue after .
26 place de la foi !do not continue after !
Should give 3 results:
25 place de la paix
24 place de la guerre
26 place de la foi
I use:
/\d+\splace.*[^a-z\s]/iU
which works fine for
24 place de la guerre.
Since it stopps at a none alpha numeric char "."
I would like to stop the regex on no alpha OR at end of line char: any idea ?
I tried with
/\d+\splace.*[^a-z\s\n]/iU
/\d+\splace.*[^a-z\s\r]/iU
You don't need to use .* after place. You can just use [a-z\s]* to match what you want:
/\b\d+\s+place[a-z\s]*/i
RegEx Demo
Or else use negative lookahead to stop when you encounter first non-letter, non-space character:
/\b\d+\s+place.*?(?=[^a-z\s]|$)/mi
\s includes space, tabs and line breaks. That's why when you used \s in [^a-z\s]. It also negates matching on new line. You can use this:
/\d+ place de la \w+/
to match all of these:
25 place de la paix
24 place de la guerre
26 place de la foi
use a non-capturing with spaces followed by alpha characters:
/\d+\h+place(?:\h+[a-z]+)*/i
demo
Note: most of the time, the U modifier is totally useless.

Regular expression Help for Identifier in Lexical analyser

I am trying to make a lexer , i don't want to use lex file because i want to learn , so come to the point i want to make regular expression for an identifier following constraints:
'__' cannot be an identifier.
underscore always with some letter.
Id contain at least one underscore.
underscore cannot be last symbol of id.
Must have one or more digit.
Not start with Digit.
Now regular expression I've done so far:
([_a-zA-Z]*[a-zA-Z][a-zA-Z]*[_a-zA-Z]*[0-9][0-9]*[_a-zA-Z]*)*
Problem is i can't perform constraint about 'at least one _' in identifier , I can't make it more complex because i have to convert this regular expression to Non deterministic finite automaton , so could you help.
You have specified the following constraints:
may contain letters, digits, and underscores.
must contain at least one digit
must contain at least one underscore
must not end with underscore
must not start with digit
The “at least one X”-type constraints correspond to states in a state machine. Since we have two of these constraints, there are 2*2=4 states that manage whether we still need a digit or an underscore. I'll abbreviate the states:
DU – needs digit, needs underscore
Du – needs digit, has underscore
dU – has digit, needs underscore
du – has digit, has underscore
We can now create a state transition table:
STATE TRANSITIONS
_ 0-9 a-zA-Z
----- -- --- ------
DU Du dU DU
Du Du du Du
dU du dU dU
du du du du
where DU is the starting state. You have additional special requirements for the first and last state transition. Also, the end state can only be reached from the du state. Actually, du might itself be the end state if it wasn't reached via a _ input. Together with these other requirements, we get the following state transition table. The start state is S, and terminal states are marked with a *. I've left out illegal transitions.
STATE TRANSITIONS
_ 0-9 a-zA-Z
----- --- --- -----
S Du - DU
DU Du dU DU
Du Du du Du
dU du' dU dU
*du du' du du
du' du' du du
The state du' is the “we've seen everything that we need, but the last symbol was an underscore so we can't end here”-state. This state table does not force any underscore to be followed by a letter, but you should be able to add that yourself using a similar approach. The state table corresponds to a DFA, but I doubt you could simplify it by using an NFA.
We can now translate this state machine to a regular expression, but this is a bit tedious since we have six states (conjecture: the presented state machine is already minimal). By iteratively combining state transitions into regexes fragments and eliminating states, we end up with this regex:
([_][_a-zA-Z]*[0-9]|[a-zA-Z][a-zA-Z]*([_][_a-zA-Z]*[0-9]|[0-9][0-9a-zA-Z]*[_]+[0-9a-zA-Z]))[0-9a-zA-Z]*([_]+[0-9a-zA-Z]+)*
Well, unless I made a mistake. Which at this point is pretty likely.
For this kind of validation, it is cumbersome to use regexes. Instead, use a simpler pattern like [_a-zA-Z][_0-9a-zA-Z]+ and check afterwards that the matched string contains a digit and an underscore, and follows the other rules regarding underscores. This works well if the identifiers are somehow delimited in the input language, e.g. by whitespace or other non-identifier characters.