clarification of Python regular expression - regex

I am a little confused on how regular expressions and sub works in Python.
I have this example:
nw = " textttt "
nw = re.sub(r'\s+(textttt)\s+', r'\1 ', nw)
The value in nw will be nw = "textttt ".
However if I have:
nw = " textttt "
nw = re.sub(r'\s(textttt)\s', r'\1 ', nw)
The value of nw will be nw = " textttt ".
Can someone please explain how the first and second results are generated and why they are different?

For clarity, let's replace spaces with digits:
import re
nw = "01textttt2345"
xx = re.sub(r'\d+(textttt)\d+', r'\1 ', nw)
print '[%s]' % xx # [textttt ]
xx = re.sub(r'\d(textttt)\d', r'\1 ', nw)
print '[%s]' % xx # [0textttt 345]
The first expression finds 01textttt2345 and replaces this with the value of the group(=textttt) plus a space. The second one finds only 1textttt2 and replaces that with textttt, leaving the rest of the string untouched.

\\s - works for single whitespace character
\\s+ - works for sequence of one or more whitespace characters.

Related

How to match escaped group signs {&date:dd.\{mm\}.yyyy} but not {&date:dd.{mm}.yyyy} with vba and regex

I'm trying to create a pattern for finding placeholders within a string to be able to replace them with variables later. I'm stuck on a problem to find all these placeholders within a string according to my requirement.
I already found this post, but it only helped a little:
Regex match ; but not \;
Placeholders will look like this
{&var} --> Variable stored in a dictionary --> dict("var")
{$prop} --> Property of a class cls.prop read by CallByName and PropGet
{#const} --> Some constant values by name from a function
Generally I have this pattern and it works well
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = "\{([#\$&])([\w\.]+)\}"
For example I have this string:
"Value of foo is '{&var}' and bar is '{$prop}'"
I get 2 matches as expected
(&)(var)
($)(prop)
I also want to add a formating part like in .Net to this expression.
String.Format("This is a date: {0:dd.mm.yyyy}", DateTime.Now());
// This is a date: 05.07.2019
String.Format("This is a date, too: {0:dd.(mm).yyyy}", DateTime.Now());
// This is a date, too: 05.(07).2019
I extended the RegEx to get that optional formatting string
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = "\{([#\$&])([\w\.]+):{0,1}([^\}]*)\}"
RegEx.Execute("Value of foo is '{&var:DD.MM.YYYY}' and bar is '{$prop}'")
I get 2 matches as expected
(&)(var)(DD.MM.YYYY)
($)(prop)()
At this point I noticed I have to take care for escapet "{" and "}", because maybe I want to have some brackets within the formattet result.
This does not work properly, because my pattern stops after "...{MM"
RegEx.Execute("Value of foo is '{&var:DD.{MM}.YYYY}' and bar is '{$prop}'")
It would be okay to add escape signs to the text before checking the regex:
RegEx.Execute("Value of foo is '{&var:DD.\{MM\}.YYYY}' and bar is '{$prop}'")
But how can I correctly add the negative lookbehind?
And second: How does this also works for variables, that should not be resolved, even if they have the correct syntax bus the outer bracket is escaped?
RegEx.Execute("This should not match '\{&var:DD.\{MM\}.YYYY\}' but this one '{&var:DD.\{MM\}.YYYY}'")
I hope my question is not confusing and someone can help me
Update 05.07.19 at 12:50
After the great help of #wiktor-stribiżew the result is completed.
As requested i provide some example code:
Sub testRegEx()
Debug.Print FillVariablesInText(Nothing, "Date\\\\{$var01:DD.\{MM\}.YYYY}\\\\ Var:\{$nomatch\}{$var02} Double: {#const}{$var01} rest of string")
End Sub
Function FillVariablesInText(ByRef dict As Dictionary, ByVal txt As String) As String
Const c_varPattern As String = "(?:(?:^|[^\\\n])(?:\\{2})*)\{([#&\$])([\w.]+)(?:\:([^}\\]*(?:\\.[^\}\\]*)*))?(?=\})"
Dim part As String
Dim snippets As New Collection
Dim allMatches, m
Dim i As Long, j As Long, x As Long, n As Long
' Create a RegEx object and execute pattern
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.pattern = c_varPattern
RegEx.MultiLine = True
RegEx.Global = True
Set allMatches = RegEx.Execute(txt)
' Start at position 1 of txt
j = 1
n = 0
For Each m In allMatches
n = n + 1
Debug.Print "(" & n & "):" & m.value
Debug.Print " [0] = " & m.SubMatches(0) ' Type [&$#]
Debug.Print " [1] = " & m.SubMatches(1) ' Name
Debug.Print " [2] = " & m.SubMatches(2) ' Format
part = "{" & m.SubMatches(0)
' Get offset for pre-match-string
x = 1 ' Index to Postion at least +1
Do While Mid(m.value, x, 2) <> part
x = x + 1
Loop
' Postition in txt
i = m.FirstIndex + x
' Anything to add to result?
If i <> j Then
snippets.Add Mid(txt, j, i - j)
End If
' Next start postition (not Index!) + 1 for lookahead-positive "}"
j = m.FirstIndex + m.Length + 2
' Here comes a function get a actual value
' e.g.: snippets.Add dict(m.SubMatches(1))
' or : snippets.Add Format(dict(m.SubMatches(1)), m.SubMatches(2))
snippets.Add "<<" & m.SubMatches(0) & m.SubMatches(1) & ">>"
Next m
' Any text at the end?
If j < Len(txt) Then
snippets.Add Mid(txt, j)
End If
' Join snippets
For i = 1 To snippets.Count
FillVariablesInText = FillVariablesInText & snippets(i)
Next
End Function
The function testRegEx gives me this result and debug print:
(1):e\\\\{$var01:DD.\{MM\}.YYYY(2):}{$var02
[0] = $
[1] = var02
[2] =
(1):e\\\\{$var01:DD.\{MM\}.YYYY
[0] = $
[1] = var01
[2] = DD.\{MM\}.YYYY
(2):}{$var02
[0] = $
[1] = var02
[2] =
(3): {#const
[0] = #
[1] = const
[2] =
(4):}{$var01
[0] = $
[1] = var01
[2] =
Date\\\\<<$var01>>\\\\ Var:\{$nomatch\}<<$var02>> Double: <<#const>><<$var01>> rest of string
You may use
((?:^|[^\\])(?:\\{2})*)\{([#$&])([\w.]+)(?::([^}\\]*(?:\\.[^}\\]*)*))?}
To make sure the consecutive matches are found, too, turn the last } into a lookahead, and when extracting matches just append it to the result, or if you need the indices increment the match length by 1:
((?:^|[^\\])(?:\\{2})*)\{([#$&])([\w.]+)(?::([^}\\]*(?:\\.[^}\\]*)*))?(?=})
^^^^^
See the regex demo and regex demo #2.
Details
((?:^|[^\\])(?:\\{2})*) - Group 1 (makes sure the { that comes next is not escaped): start of string or any char but \ followed with 0 or more double backslashes
\{ - a { char
([#$&]) - Group 2: any of the three chars
([\w.]+) - Group 3: 1 or more word or dot chars
(?::([^}\\]*(?:\\.[^}\\]*)*))? - an optional sequence of : and then Group 4:
[^}\\]* - 0 or more chars other than } and \
(?:\\.[^}\\]*)* - zero or more reptitions of a \-escaped char and then 0 or more chars other than } and \
} - a } char
Welcome to the site! If you need to only match balanced escapes, you will need something more powerful. If not --- I haven't tested this, but you could try replacing [^\}]* with [^\{\}]|\\\{|\\\}. That is, match non-braces and escaped brace sequences separately. You may need to change this depending on how you want to handle backslashes in your formatting string.

Stars and string combination pattern in Python

I want a pattern like:
Input : Python is Interactive (any string separated by space)
Expected Output:
*************
*Python *
*is *
*Interactive*
*************
I tried using python's "re" module ,not able to create the stars in the pattern
inp = "Python is interactive"
import re
split = re.split(' ', inp)
length = []
for item in range(len(split)):
length.append(len(split[item]))
Max = (max(length))
for i in range(len(split)):
print(split[i])
You don't need the re module. Your approach is not that bad, but needs some rework:
input = "Python is interactive"
parts = input.split(" ")
maxlen = max(map(lambda part: len(part), parts))
# or this, if you want to go even more elegant:
maxlen = max(map(len, parts))
print ('*' * (maxlen + 4))
for part in parts:
spaces = maxlen - len(part)
print("* " + part + (" " * spaces) + " *")
print ('*' * (maxlen + 4))
For splitting you can use the string.split method. Then I calculate the maximum length (like you did, but a little bit more elegant).
Then I print as many stars as the most long string is + 4 because at the beginning and end of each string there is "* " and " *", so 4 more characters.
Then I print the string with as many spaces as padding as needed.
Finally the last line of stars.

RegExp other patterns not working

I continue trying to perform string format matching using RegExp in VBScript & VB6. I am now trying to match a short, single-line string formatted as:
Seven characters:
a. Six alphanumeric plus one "-" OR
b. Five alphanumeric plus two "-"
Three numbers
Two letters
Literal "65"
A two-digit hex number.
Examples include 123456-789LM65F2, 4EF789-012XY65A5, A2345--789AB65D0 & 23456--890JK65D0.
The RegExp pattern ([A-Z0-9\-]{12})([65][A-F0-9]{2}) lumps (1) - (3) together and finds these OK.
However, if I try to:
c) Break (3) out w/ pattern ([A-Z0-9\-]{10})([A-Z]{2})([65][A-F0-9]{2}),
d) Break out both (2) & (3) w/ pattern ([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}), or
e) Tighten up (1) with alternation pattern ([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
it refuses to find any of them.
What am I doing wrong? Following is a VBScript that runs and checks these.
' VB Script
Main()
Function Main() ' RegEx_Format_sample.vbs
'Uses two paterns, TestPttn for full format accuracy check & SplitPttn
'to separate the two desired pieces
Dim reSet, EtchTemp, arrSplit, sTemp
Dim sBoule, sSlice, idx, TestPttn, SplitPttn, arrMatch
Dim arrPttn(3), arrItems(3), idxItem, idxPttn, Msgtemp
Set reSet = New RegExp
' reSet.IgnoreCase = True ' Not using
' reSet.Global = True ' Not using
' load test case formats to check & split
arrItems(0) = "0,6 nums + 1 '-',123456-789LM65F2"
arrItems(1) = "1,6 chars + 1 '-',4EF789-012XY65A5"
arrItems(2) = "2,5 chars + 2 '-',A2345--789AB65D0"
arrItems(3) = "3,5 nums + 2 '-',23456--890JK65D0"
SplitPttn = "([A-Z0-9]{5,6})[-]{1,2}([A-Z0-9]{9})" ' split pattern has never failed to work
' load the patterns to try
arrPttn(0) = "([A-Z0-9\-]{12})([65][A-F0-9]{2})"
arrPttn(1) = "([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2})"
arrPttn(2) = "([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
arrPttn(3) = "([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
For idxPttn = 0 To 3 ' select Test pattern
TestPttn = arrPttn(idxPttn)
TestPttn = TestPttn & "[%]" ' append % "ender" char
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
For idxItem = 0 To 3
reSet.Pattern = TestPttn ' set to Test pattern
sTemp = arrItems(idxItem )
arrSplit = Split(sTemp, ",") ' arrSplit is Split array
EtchTemp = arrSplit(2) & "%" ' append % "ender" char to Item sub (2) as the "phrase" under test
If reSet.Test(EtchTemp) = False Then
MsgBox("RegEx " & TestPttn & " false for " & EtchTemp & " as " & arrSplit(1) )
Else ' test OK; now switch to SplitPttn
reSet.Pattern = SplitPttn
Set arrMatch = reSet.Execute(EtchTemp) ' run Pttn as Exec this time
If arrMatch.Count > 0 then ' If test OK then Count s/b > 0
Msgtemp = ""
Msgtemp = "RegEx " & TestPttn & " TRUE for " & EtchTemp & " as " & arrSplit(1)
For idx = 0 To arrMatch.Item(0).Submatches.Count - 1
Msgtemp = Msgtemp & Chr(13) & Chr(10) & "Split segment " & idx & " as " & arrMatch.Item(0).submatches.Item(idx)
Next
MsgBox(Msgtemp)
End If ' Count OK
End If ' test OK
Next ' idxItem
Next ' idxPttn
End Function
Try this Regex:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Click for Demo
Explanation:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--) - matches either 6 Alphanumeric characters followed by a - or 5 Alphanumeric characters followed by a --
[0-9]{3} - matches 3 Digits
[A-Z]{2} - matches 2 Letters
65 - matches 65 literally
[0-9A-F]{2} - matches 2 HEX symbols
You can get some idea from the following code:
VBScript Code:
Option Explicit
Dim objReg, strTest
strTest = "123456-789LM65F2" 'Change the value as per your requirements. You can also store a list of values in an array and run the code in loop
set objReg = new RegExp
objReg.Global = True
objReg.IgnoreCase = True
objReg.Pattern = "(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}"
if objReg.test(strTest) then
msgbox strTest&" matches with the Pattern"
else
msgbox strTest&" does not match with the Pattern"
end if
set objReg = Nothing
Your patterns do not work because:
([A-Z0-9\-]{12})([65][A-F0-9]{2}) - matches 12 occurrences of either an AlphaNumeric character or - followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2}) - matches 10 occurrences of either an AlphaNumeric character or - followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches 7 occurrences of either an AlphaNumeric character or - followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches either 5 occurrences of an AlphaNumeric character followed by -- or 6 occurrences of an Alphanumeric followed by a -. This is then followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
Try this pattern :
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Or, if the last part doesn't like the [A-F]
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9ABCDEF]{2}
All, tanx again for your help!!
trincot, everything in each arrItems() between the commas, incl the the "plus", is merely part of a shorthand description of each item's characteristics, such as "5 characters plus 2 dashes".
Gurman, your pttn breakdowns were helpful, but, if I read it right, the addition of the ? prefix is a "Match zero or one occurrences" and this must match exactly one occurrence. Also, my 1st pattern (matches 12) actually DID work for all my test cases.
jNevill, & JMichelB your suggestions are very close to what I ended up with.
I was "over-classing". After some tinkering, I was able to get the Test Pttn to successfully recognize these test cases by taking the [65] out of the [] in my original Alternation pattern. That is I went from ([65]) to (65) and Zammo! it worked.
Orig pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
Wkg pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})(65)([A-F0-9]{2})
Oh, and I moved the
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
stmt up out of the For...Next loop. That helped w/ the splitting.
T-Bone

Matlab: regexprep with variable

I have an array a : a list of identified words to be compared and replace by empty character in an array b. newB is the result.
The value of a might vary according to an input file.
I am trying to use regexprep but it is not working well.
e.g.:
a = {'apple';'banana';'orange'}; % a might be also ‘watermelon’, ‘papaya’ etc
b = {'1 apple = 2 kiwi';'1 fig = 1 banana';'1 orange = 3 strawberry'};
newB = {' = 2 kiwi';'1 fig = ';' = 3 strawberry'};
From your example it seems like you want to remove a special word and a number, the appropriate regular expression for this is (for word = 'apple'): '\d+ apple'. Building the regular expression from all the words in a, using sprintf:
re = sprintf('\\d+ %s|',a{:}); %// adding | operator to select between expressions
re(end)=[]; %// discard the last '|'
The resulting regular expression is
re =
'\d+ apple|\d+ banana|\d+ orange'
Now the actual replacement:
newB = regexprep(b,re,'')
Resulting with
newB =
' = 2 kiwi'
'1 fig = '
' = 3 strawberry'

Why is this regexp slow when the input line is long and has many spaces?

VBScript's Trim function only trims spaces. Sometimes I want to trim TABs as well. For this I've been using this custom trimSpTab function that is based on a regular expression.
Today I ran into a performance problem. The input consisted of rather long lines (several 1000 chars).
As it turns out
- the function is slow, only if the string is long AND contains many spaces
- the right-hand part of the regular expression is reponsible for the poor performance
- the run time seems quadratic to the line length (O(n^2))
So why is this line trimmed fast
" aaa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx bbb " '10000 x's
and this one trimmed slowly
" aaa bbb " '10000 spaces
Both contain only 6 characters to be trimmed.
Can you propose a modification to my trimSpTab function?
Dim regex
Set regex = new regexp
' TEST 1 - executes in no time
' " aaa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX bbb "
t1 = Timer
character = "X"
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
' TEST 2 - executes in 1 second on my machine
' " aaa bbb "
t1 = Timer
character = " "
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
Sub trimTest (character)
sInput = " aaa " & String (10000, character) & " bbb "
trimmed = trimSpTab (sInput)
End Sub
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Global = True
regex.Pattern = "^[ \t]+|[ \t]+$" 'trim left+right
trimSpTab = regex.Replace (s, "")
End Function
I have tried this (with regex.Global = false) but to no avail
regex.Pattern = "^[ \t]+" 'trim left
s = regex.Replace (s, "")
regex.Pattern = "[ \t]+$" 'trim right
trimSpTab = regex.Replace (s, "")
UPDATE
I've come up with this alternative in the mean time. It processes a 100 million character string is less than a second.
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Pattern = "^[ \t]+"
s = strReverse (s)
s = regex.Replace (s, "")
s = strReverse (s)
s = regex.Replace (s, "")
trimSpTab = s
End Function
Solution
As mentioned in the question, your current solution is to reverse the string. However, this is not necessary, since .NET regex supports RightToLeft matching option. For the same regex, the engine will start matching from right to left instead of default behavior of matching from left to right.
Below is sample code in C#, which I hope you can adapt to VB solution (I don't know VB enough to write sample code):
input = new Regex("^[ \t]+").Replace(input, "", 1)
input = new Regex("[ \t]+$", RegexOptions.RightToLeft).Replace(input, "", 1)
Explanation
The long run time is due to the engine just trying to match [ \t]+ indiscriminately in the middle of the string and end up failing when it is not an trailing blank sequence.
The observation that the complexity is quadratic is correct.
We know that the regex engine starts matching from index 0. If there is a match, then the next attempt starts at the end of the last match. Otherwise, the next attempt starts at the (current index + 1). (Well, to simplify things, I don't mention the case where a zero-length match is found).
Below shall illustrate the farthest attempt (some is a match, some are not) of the engine matching the regex ^[ \t]+|[ \t]+$. _ is used to denote space (or tab character) for clarity.
_____ab_______________g________
^----
^
^
^--------------
^-------------
^------------
...
^
^
^-------
When there is a long sequence of spaces & tabs in the middle of the string (which will not produce a match), the engine attempts matching at every index in the long sequence of spaces & tabs. As the result, the engine ends up going through O(k2) characters on a non-matching sequence of spaces & tabs of length k.
Your evidence proves that VBScript's RegExp implementation does not optimize for the $ anchor: It spends time (backtracking?) for each of the spaces in the middle of your test string. Without doubt, that's a fact good to know.
If this causes you real world problems, you'll have to find/write a better (R)Trim function. I came up with:
Function trimString(s, p)
Dim l : l = Len(s)
If 0 = l Then
trimString = s
Exit Function
End If
Dim ps, pe
For ps = 1 To l
If 0 = Instr(p, Mid(s, ps, 1)) Then
Exit For
End If
Next
For pe = l To ps Step -1
If 0 = Instr(p, Mid(s, pe, 1)) Then
Exit For
End If
Next
trimString = Mid(s, ps, pe - ps + 1)
End Function
It surely needs testing and benchmarks for long heads or tails of white space, but I hope it gets you started.