Python regex match across multiple lines - regex

I am trying to match a regex pattern across multiple lines. The pattern begins and ends with a substring, both of which must be at the beginning of a line. I can match across lines, but I can't seem to specify that the end pattern must also be at the beginning of a line.
Example string:
Example=N ; Comment Line One error=
; Comment Line Two.
Desired=
I am trying to match from Example= up to Desired=. This will work if error= is not in the string. However, when it is present I match Example=N ; Comment Line One error=
config_value = 'Example'
pattern = '^{}=(.*?)([A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)
I also tried:
config_value = 'Example'
pattern = '^{}=(.*?)(^[A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)

You may use
config_value = 'Example'
pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
match = re.search(pattern, s)
if match:
print(match.group(1))
See the Python demo.
Pattern details
(?sm) - re.DOTALL and re.M are on
^ - start of a line
Example= - a substring
(.*?) - Group 1: any 0+ chars, as few as possible
(?=[\r\n]+\w+=|\Z) - a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a = sign, or end of the string (\Z).
See the regex demo.

Related

Replace N spaces at the beginning of a line with N characters

I am looking for a regex substitution to transform N white spaces at the beginning of a line to N . So this text:
list:
- first
should become:
list:
- first
I have tried:
str = "list:\n - first"
str.gsub(/(?<=^) */, " ")
which returns:
list:
- first
which is missing one . How to improve the substitution to get the desired output?
You could make use of the \G anchor and \K to reset the starting point of the reported match.
To match all leading single spaces:
(?:\R\K|\G)
(?: Non capture group
\R\K Match a newline and clear the match buffer
| Or
\G Assert the position at the end of the previous match
) Close non capture group and match a space
See a regex demo and a Ruby demo.
To match only the single leading spaces in the example string:
(?:^.*:\R|\G)\K
In parts, the pattern matches:
(?: Non capture group
^.*:\R Match a line that ends with : and match a newline
| Or
\G Assert the position at the end of the previous match, or at the start of the string
) Close non capture group
\K Forget what is matched so far and match a space
See a regex demo and a Ruby demo.
Example
re = /(?:^.*:\R|\G)\K /
str = 'list:
- first'
result = str.gsub(re, ' ')
puts result
Output
list:
- first
I would write
"list:\n - first".gsub(/^ +/) { |s| ' ' * s.size }
#=> "list:\n - first"
See String#*
Use gsub with a callback function:
str = "list:\n - first"
output = str.gsub(/(?<=^|\n)[ ]+/) {|m| m.gsub(" ", " ") }
This prints:
list:
- first
The pattern (?<=^|\n)[ ]+ captures one or more spaces at the start of a line. This match then gets passed to the callback, which replaces each space, one at a time, with .
You can use a short /(?:\G|^) / regex with a plain text replacement pattern:
result = text.gsub(/(?:\G|^) /, ' ')
See the regex demo. Details:
(?:\G|^) - start of a line or string or the end of the previous match
- a space.
See a Ruby demo:
str = "list:\n - first"
result = str.gsub(/(?:\G|^) /, ' ')
puts result
# =>
# list:
# - first
If you need to match any whitespace, replace with a \s pattern. Or use \h if you need to only match horizontal whitespace.

Regular Expression to match first word with a character in each line

I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?
Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}
If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo
The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.

Regex that only matches when no duplicate lines are found

I have a multiline string like this:
SA21 abcdef
BKxyz
SA21 abcdef
I need a regex that only matches if the line ^SA21 abcdef$ is present once. So it should not match for the first example but it should match for this one:
BK udsia
SA21 abcdef
BKxyz
I tried to capture the line and make sure it matches only when the same line is not found later: /(^SA21 abcdef$)(?!\1)/m regex101 but that does not work as it will probably always match the last line...
The regex you want should only match a line if the line is not present before or after the single occurrence of the line. This is achieved with a tempered greedy token:
/\A(?:(?!^SA21 abcdef$).)*(^SA21 abcdef$)(?:(?!^SA21 abcdef$).)*\z/ms
See the regex demo
The (?:(?!^SA21 abcdef$).)* is the token matching any text but the beginning of the SA21 abcdef line. The /s modifier is required so that a . could match a newline.
However, the construct is resource consuming, and it is a good idea to unroll it:
/\A(?:\n+(?!SA21 abcdef$).*)*\n*^(SA21 abcdef)$(?:\n+(?!SA21 abcdef$).*)*\z/m
See another demo
Note that \A and \z are unambiguous start/end string anchors, the /m modifier does not affect them.
Pattern explanation:
\A - start of string
(?:\n+(?!SA21 abcdef$).*)* - zero or more sequences of:
\n+ - 1 or more newlines ...
(?!SA21 abcdef$) - not followed with SA21 abcdef that is the whole line
.* - zero or more chars other than a newline
\n* - zero or more newlines
^ - start of a line
(SA21 abcdef) - the line that must be single
$ - end of line
(?:\n+(?!SA21 abcdef$).*)* - see above
\z - end of string.

Added some regex into existing regular pattern

I am not good regex and need to update following pattern without impacting other pattern. Any suggestion $ sign contain 1t0 4. $ sign always be begining of the line.( space may or may not be)
import re
data = " $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = '^ (?:ABC *)?([A-Za-z0-9/\._\:]+)\s*: ? '
match = re.findall( patt, data, re.M )
print match
Note : data is multi line string
match should contain : "$$$AKL_M0_90_2K" this result
I suggest the following solution (see IDEONE demo):
import re
data = r" $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = r'^\s*([$]{1,4}[^:]+)'
match = re.findall( patt, data, re.M )
print(match)
The re.findall will return the list with just one match. The ^\s*([$]{1,4}[^:]+) regex matches:
^ - start of a line (you use re.M)
\s* - zero or more whitespaces
([$]{1,4}[^:]+) - Group 1 capturing 1 to 4 $ symbols, and then one or more characters other than :.
See the regex demo
If you need to keep your own regex, just do one of the following:
Add $ to the character class (demo): ^ (?:ABC *)?([$A-Za-z0-9/._:]+)\s*: ?
Add an alternative to the first non-capturing group and place it at the start of the capturing one (demo): ^ ((?:ABC *|[$]{1,4})?[A-Za-z0-9/._:]+)\s*: ?