Python Regex for preposition in sentence - regex

I have a sentence that has a preposition or preposition in a word and I want to separate the words "di" and "ke" a sentence.
Code in this link
sentence = "kemana dimanake di daladi dipukul ke situ"
regex_patern = r"^(di)|(ke)"
the sentence I want is
result= "ke mana di manake di daladi di pukul ke situ"

One option is to match either ke or di followed by asserting a position where a word boundary does not match \B.
(?:ke|di)\B
You could prepend the pattern using a word boundary \b(?:ke|di)\B if ke or di should not be part of a longer word.
Then replace with the full match followed by a space:
\g<0>
Regex demo | Python demo
For example
import re
sentence = "kemana dimanake di daladi dipukul ke situ"
regex_patern = r"(?:ke|di)\B"
print(re.sub(regex_patern, r"\g<0> ", sentence))
Result
ke mana di manake di daladi di pukul ke situ
If you want to make the match a bit broader, you could also use a positive lookahead (?=\S) asserting what is on the right is a non whitespace char.
(?:ke|di)(?=\S)
Regex demo

Related

Regex - Return only non repeated items

I need to return only non-repeating elements from a string.
This is my current regex:
/{{([1-9]|[1-9][0-9])}}/gm
Here are some samples of what I need:
{{1}}{{2}}{{3}}{{4}}{{5}}{{1}} => {{1}}{{2}}{{3}}{{4}}{{5}}
{{1}}{{2}}{{3}}{{4}}{{2}}{{1}} => {{1}}{{2}}{{3}}{{4}}
{{1}}{{2}}{{2}}{{4}}{{1}}{{1}} => {{1}}{{2}}{{4}}
I have tried the following lookaround:
/{{([1-9]|[1-9][0-9])}}(?!.*\1)/gm
Kind of works for this example, but it get the last element instead of the first:
{{1}}{{2}}{{3}}{{4}}{{5}}{{1}} => {{2}}{{3}}{{4}}{{5}}{{1}}
With this example it don't work:
Olá, {{1}}
Só pra lembrar, sua consulta foi agendada para o dia {{2}} às {{3}}h.
Por favor, não se atrase
Se por acaso tiver algum imprevisto e não puder comparecer, nos avisem com {{4}}h de antecedência.
{{1}}, te esperamos aqui!
The string above returns all elements:
{{1}}{{2}}{{3}}{{4}}{{1}}
If I understand problem correctly, here is what you need:
You want to match {{<num>}} patterns
But don't want to match duplicates occurrence of same pattern
Duplicates of {{<num>}} can occur any where in multiline text
You prefer to match first occurrence of duplicates
Since you showed Javascript syntax of your regex, I am also using lookbehind feature in my regex as per modern Javascript behavior:
({{(?:[1-9]|[1-9]\d)}})(?<!^(?:[^]*\1){2})
RegEx Demo 1
RegEx Demo 2
RegEx Breakup:
({{(?:[1-9]|[1-9]\d)}}): Match {{<num>}} text and capture in group #1
(?<!^(?:[^]*\1){2}): Negative lookbehind with a dynamic length to assert failure if we have 2 repeats of capture group #1

Vi: Substitution pattern SQL file. Issue with the Regex

I have to modify a SQL file with vi to delete columns that we do not use. As we have a lot of data, I use the search and replace option with a Regex Pattern.
For instance we have :
(1,2956,2026442,4,NULL,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',NULL,1,'27229',NULL,NULL,NULL,NULL,NULL,' Rue DU LUXEMBOURG, 9999 EVREUX',NULL,NULL,NULL,NULL,
NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'2020-07-08 16:34:40',NULL,NULL)
So we have 40 columns and I keep 13 ones. My regex is :
(1),2,(3),4-5,(6-14),15-22,(23),24-39,(40)
:%s/(\(.\{-}\),.\{-},\(.\{-}\),.\{-},.\{-},\(.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-}\),.\{-},
.\{-}, .\{-},.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\),.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},
.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\))/(\1,\2,\3,\4,\5)/g
I enclose in my parenthesis the parts that interest me by putting them in parenthesis (I only get the values in parenthesis on the line above my regex ). Then with the replace I recover these groups.
So normally my result is suppose to be :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',' Rue DU LUXEMBOURG, 9999 EVREUX',NULL)
But Because in ' Rue DU LUXEMBOURG, 9999 EVREUX' there is a comma (,). My result become :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,'9999','EVREUX',' Rue DU LUXEMBOURG',NULL,NULL)
Does Someone who is good in Regex can help me ? thanks in advance. If I wasn't clear tell me too, i will try to explain better next time.
I suggest matching fields that can be strings with a %('[^']*'|\w*) pattern, that is, a non-capturing group that finds either ' + zero or more non-'s and then a ' char, or any zero or more alphanumeric characters.
Also, the use of non-capturing groups (in Vim, it is %(...) in very magic mode, or \%(...\) in a regular mode) and very magic mode can help shorten the pattern.
The whole pattern will look like
:%s/\v\(([^,]*),[^,]*,([^,]*),[^,]*,[^,]*,(%('[^']*'|\w*)%(,%('[^']*'|\w*)){8})%(,%('[^']*'|\w*)){8},('[^']*'|\w*)%(,%('[^']*'|\w*)){16},([^,]*)\)/(\1,\2,\3,\4,\5)/g
See the regex demo converted to a PCRE regex.
Note some fields that are not strings are matched with [^,]* that matches zero or more chars other than a comma. The %(,%('[^']*'|\w*)){8} like patterns match (here) 8 occurrences of a sequence of a , char + '...' substring or zero or more word chars.

Remove spaces (apostrophes) around quotes with regex in ruby

I'm trying to remove all spaces around quotes with one Ruby regex. (not the same question as this)
Input: l' avant ou l 'après ou encore ' maintenant'
Output: l'avant ou l'après ou encore 'maintenant'
What I tried:
(/'\s|\s'/, '')
It's matching a few cases, but not all.
How to perform this ? Thanks.
TLDR:
I assume the spaces were inserted by some automation software and there can only be single spaces around the words.
s = "l' avant ou l 'apres ou encore ' maintenant' ou bien 'ceci ' et ' encore de l ' huile ' d 'accord d' accord d ' accord Je n' en ai pas .... s ' entendre Je m'appelle Victor"
first_rx = /(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i
# If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc],
# i.e. first letters of word that are usually contracted
second_rx = /\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/
puts s.gsub(first_rx, "'")
.gsub(second_rx) { $~[1] ? "'#{$~[1]}'" : "" }
Output:
l'avant ou l'apres ou encore 'maintenant' ou bien 'ceci' et 'encore de l'huile' d'accord d'accord d'accord Je n'en ai pas .... s'entendre Je m'appelle Victor
Explanation
The problem is really complex. There are several words that can be abbreviated and used with an apostrophe in French, de, le/la, ne, se, me, te, ce to name a few, but these are all consonants. You may remove all spaces between a single, standalone consonant, apostrophe and the next word using
s.gsub(/(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i, "'")
If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc], i.e. first letters of word that are usually contracted. See the regex demo.
Next step is to remove spaces after initial and before trailing apostrophes. This is tricky:
s.gsub(/\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/) { $~[1] ? "'#{$~[1]}'" : "" }
where \b'\b is meant to match all apsotrophes in between word chars, those that we fixed at the previous step. See this regex demo. As there is no (*SKIP)(*F) support in Onigmo regex, the regex is a bit simplified but the replacement is a conditional one: if Group 1 matched, replace with ' + Group 1 value ($1) + ', else, replace with an empty string (since \K reset the match, dropped all text from the match memory buffer).
NOTE: this approach can be extended to handle some specific cases like aujourd'hui, too.
To remove all whitespace around the ', use gsub!, applied in several steps for proper whitespace removal:
str = "l' avant ou l 'apres ou encore ' maintenant'"
str.gsub!(/\b'\s+\b/, "'").gsub!(/\b\s+'\b/, "'").gsub!(/\b(\s+')\s+\b/, '\1')
puts str
# l'avant ou l'apres ou encore 'maintenant'
Here,
\b : word boundary,
\s+ : 1 or more whitespace,
string.gsub!(regex, replacement_string) : replace in the string argument regex with specified replacement_string (during this, the original string is changed),
\1 : in the replacement string, this refers to the first group captured in parenthesis in the regex: (...).
So if you have alot of data like this, all the answers I have seen are wrong, and will not work. No regex can guess wether the preceding word should have a space or not. Unless you came up with a list of words (or patterns) that either do or don't.
The problem is, sometimes a space should be left, sometimes not. The only way to script that is to find a pattern which describes when the space should be there, or when not. You must teach your regex French grammar. It may be possible lol. But probably not, or difficult.
If this is a one off, my advice is to create regexes for 2 or 3 different situations, and use something like vim, to go through the data, and select manually yes or no to substitute each occurrence.
There may be some cases you can run - eg remove all spaces to the right of quotes? - but unfortunately I don't think you can automate this process.
I believe the following should work for you
s.gsub(/'.*?'/){ |e| "'#{e[1...-1].strip}'" }
The regex portion lazy matches all text within single quotes (including quotes). Then, for each match you substitute for the quoted text with leading and trailing whitespace removed, and return this text in quotes.

How to match different groups in regex

I have the following string:
"Josua de Grave* (1643-1712)"
Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.
Following this logic I'd like to have 3 match groups for each one of the item. I tried
([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)
but that returns nil.
Why is my logic wrong, and what should I do to get the 3 intended match groups.
The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use
([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
// ^^ ^^
since brackets represents the captured group so escape them using \ to match them as a character.
While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:
Using split:
s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]
Or, using scan:
*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ') # => "Josua de Grave"
birth # => "1643"
death # => "1712"
If I was using a pattern, I'd use this:
name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"
/(^[^*]+).+?(\d+)-(\d+)/ means:
^ start at the beginning of the buffer
([^*]+) capture everything not *, where it'll stop capturing
.+? skip the minimum until...
(\d+) the year is matched and captured
- match but don't capture
(\d+) the year is matched and captured
Regexper helps explain it as does Rubular.
r = /\*\s+\(|(?<=\d)\s*-\s*|\)/
"Josua de Grave* (1643-1712)".split r
#=> ["Josua de Grave", "1643", "1712"]
"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
#=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]
The regular expression can be made self-documenting by writing it in free-spacing mode:
r = /
\*\s+\( # match '*' then >= 1 whitespaces then '('
| # or
(?<=\d) # match is preceded by a digit (positive lookbehind)
\s*-\s* # match >= 0 whitespaces then '-' then >= 0 whitespaces
| # or
\) # match ')'
/x # free-spacing regex definition mode
The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

Why does my regular expression select everything?

Hey guys, I'm trying to select a specific string out of a text, but I'm not a master of regular expressions.
I tried one way, and it starts from the string I want but it matches everything after what I want too.
My regex:
\nSCR((?s).*)(GI|SI)(.*?)\n
Text I'm matching on.
Hierbij een test
SCR
S09
/vince#test.be
05FEB
GI BRGDS OPS
middle text string (may not selected)
SCR
S09
05FEB
LHR
NPVT700 PVT701 30MAR30MAR 1000000 005CRJ FAB1900 07301NCE DD
/ RE.GBFLY/
GI BRGDS
The middle string is selected, it only needs the SCR until the GI line.
Use the non-greedy quantifier also on the first quantifier:
\nSCR((?s).*?)(GI|SI)(.*?)\n
Or you could use a negative look-ahead assertion (?!expr) to capture just those lines that do not start with either GI or SI:
\nSCR((?:\n(?!GI|SI).*)*)\n(?:GI|SI).*\n
To match from a line starting with SCR to a line starting with GI or SI (inclusive), you would use the following regular expression:
(?m:^SCR\n(?:^(?!GI|SI).*\n)*(?:GI|SI).*)
This will:
Find the start of a line.
Match SCR and a new line.
Match all lines not starting with GI or SI.
Match the last line, requiring there to be GI or SI (this prevents it from matching to the end of the string if there is no GI or SI.