Regex - number of characters for sequence - regex

I have the following pattern:
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
//etc
There is always one letter and digits.
I need to create a regex which uses number from the tag and applies it to the sequence between tags. I know that I can use a backreference but I don't know how to construct a regex. Here is incomplete regex:
"^<tag-([2-9])>[A-Z][0-9]/*how to apply here number from the tag ?*/</tag-\\1>$"
Edit
The following strings are not matched:
<tag-2>11</tag-2> //missing letter
<tag-2>BB</tag-2> // missing digit
<tag-3>B123</tag-3> //too many digits
<tag-3>AA1</tag-3> //should be only one letter and two digits
<tag-4>N12</tag-4> //too few digits

Regular expressions cannot contain elements that are functions of the values of back-references (other than the back-references themselves). That's because regular expressions are static from the time they are constructed.
One could, however, extract the desired string, or conclude that the sting contains no valid substring, in two steps. First attempt to match the string against /<tag-(\d+)>, where the contents of the capture group, after being converted to an integer, equals the length of the string that begins with a capital letter and is followed by digits. That information can then be used to construct a second regular expression that is used to verify the remainder of the match and extract the desired string.
I will use Ruby to illustrate how that might be done here. The operations--and certainly the two regular expressions--should be clear even to readers who are not familiar with Ruby.
Code
R = /<tag-(\d+)>/ # a constant
def doit(str)
m = str.match(R) # obtain a MatchData object; else nil
return nil if m.nil? # finished if no match
n = m[1].to_i-1 # required number of digits
r = /\A\p{Lu}\d{#{n}}(?=<\/tag-#{m[1]}>)/
# regular expression for second match
str[m.end(0).to_i..-1][r] # extract the desired string; else nil
end
Examples
arr = <<_.each_line.map(&:chomp)
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
<tag-2>11</tag-2>
<tag-2>BB</tag-2>
<tag-3>B123</tag-3>
<tag-3>AA1</tag-3>
<tag-4>N12</tag-4>
_
#=> ["<tag-2>B1</tag-2>", "<tag-3>A12</tag-3>",
# "<tag-4>M123</tag-4>", "<tag-2>11</tag-2>",
# "<tag-2>BB</tag-2>", "<tag-3>B123</tag-3>",
# "<tag-3>AA1</tag-3>", "<tag-4>N12</tag-4>"]
arr.map do |line|
s = doit(line)
s = 'nil' if s.nil?
puts "#{line.ljust(22)}: #{s}"
end
<tag-2>B1</tag-2> : B1
<tag-3>A12</tag-3> : A12
<tag-4>M123</tag-4> : M123
<tag-2>11</tag-2> : nil
<tag-2>BB</tag-2> : nil
<tag-3>B123</tag-3> : nil
<tag-3>AA1</tag-3> : nil
<tag-4>N12</tag-4> : nil
Explanation
Note that (?=<\/tag-#{m[1]}>) (part of r in the body of the method) is a positive lookahead, meaning that "<\/tag-#{m[1]}>" (with #{m[1]} substituted out) must be matched, but is not part of the match that is returned.
The step-by-step calculations are as follows.
str = "<tag-2>B1</tag-2>"
m = str.match(R)
#=> #<MatchData "<tag-2>" 1:"2">
m[0]
#=> "<tag-2>" (match)
m[1]
#=> "2" (contents of capture group 1)
m.end(0)
#=> 7 (index of str where the match ends, plus 1)
m.nil?
#=> false (do not return)
n = m[1].to_i-1
#=> 1 (number of digits required)
r = /\A\p{Lu}\d{#{n}}(?=\<\/tag\-#{m[1]}\>)/
#=> /\A\p{Lu}\d{1}(?=\<\/tag\-2\>)/
s = str[m.end(0).to_i..-1]
#=> str[7..-1]
#=> "B1</tag-2>"
s[r]
#=> "B1"

It looks like you're trying to create a pattern that will interpret a number in order to determine how long a string should be. I don't know of any feature to automate this process in any regular expression engine, but it can be done in a more manual fashion by enumerating all cases which you wish to handle.
For example, tags 2 through 9 can be handled as such:
<tag-2>: ^<tag-2>[A-Z][0-9]</tag-2>$
<tag-3>: ^<tag-3>[A-Z][0-9]{2}</tag-3>$
<tag-4>: ^<tag-4>[A-Z][0-9]{3}</tag-4>$
<tag-5>: ^<tag-5>[A-Z][0-9]{4}</tag-5>$
<tag-6>: ^<tag-6>[A-Z][0-9]{5}</tag-6>$
<tag-7>: ^<tag-7>[A-Z][0-9]{6}</tag-7>$
<tag-8>: ^<tag-8>[A-Z][0-9]{7}</tag-8>$
<tag-9>: ^<tag-9>[A-Z][0-9]{8}</tag-9>$
By removing the grouping and back-references you eliminate some complications that can occur when trying to combine regular expression patterns and can produce the following:
^(<tag-2>[A-Z][0-9]</tag-2>|<tag-3>[A-Z][0-9]{2}</tag-3>|<tag-4>[A-Z][0-9]{3}</tag-4>|<tag-5>[A-Z][0-9]{4}</tag-5>|<tag-6>[A-Z][0-9]{5}</tag-6>|<tag-7>[A-Z][0-9]{6}</tag-7>|<tag-8>[A-Z][0-9]{7}</tag-8>|<tag-9>[A-Z][0-9]{8}</tag-9>)$

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

RegEx repeating subcapture returns only last match

I have a regexp that tries to find multiple sub captures within matches:
https://regex101.com/r/nk1Q5J/2/
=\s?.*(?:((?i)Fields\("(\w+)"\)\.Value\)*?))
I've had simpler versions with equivalent results but this is the last iteration.
the Trick here is that the first group looks for a sequence that begins with '=' (to identify database field reads in VB.Net)
The single sub capture cases work:
Match 1. [comparison + single parameter read call]
= False And IsNull(Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
Match 3. [read into string] oRs.Open "select lmpEmployeeID,lmpShiftID,lmpTimecardDate,lmpProjectID from Timecards where lmpTimecardID = " + App.Convert.NumberToSql(Fields("lmlTimecardID").Value),Connection,adOpenStatic,adLockReadOnly,adCmdText
=> G2: lmlTimecardID
Match 4. [assignment] Fields("lmlEmployeeID").Value = oRs.Fields("lmpEmployeeID").Value
Where I am failing is a match with multiple sub-captures. My regexp returns the last (intended) sub capture :
Match 2. [read multiple input parameters] Fields("lmpPayrollHours").Value = App.Ax("TimecardFunctions").CalculateHours(Fields("lmpShiftID").Value,Fields("lmpRoundedStartTime").Value,Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
'''''''''''' ^ must capture: lmpShiftID , lmpRoundedStartTime , lmpRoundedEndTime
I've read up on lazy quantifiers etc, but can't wrap my head around where this goes wrong.
References:
https://www.regular-expressions.info/refrepeat.html
http://www.rexegg.com/regex-quantifiers.html
Related:
Regular expression: repeating groups only getting last group
What's the difference between "groups" and "captures" in .NET regular expressions?
BTW, I could quantify the sub capture as {1,5} safely for efficiency, but that's not the focus.
EDIT:
By using negative lookahead to exclude the left side of comparisons, this got me much closer (match 2 above now works):
(?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<])
but in the following block of code, only the first two are captured:
If oRs.EOF = False Then
If CInt(Fields("lmlTimecardType").Value) = 1 Then
If Trim(oRs.Fields("lmeDirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
End If
Else
If Trim(oRs.Fields("lmeIndirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
End If
End If
If CInt(Fields("lmlTimecardType").Value) = 2 Then
If Trim(oRs.FIelds("lmeDefaultWorkCenterID").Value) <> "" Then
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value
End If
End If
End If
Capture1:
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
Capture2:
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
Capture3 (failed):
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value
Actually, (?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<]) does work in my Excel sheet, just not in the regex101 test. Probably a slightly different standard used there.
https://regex101.com/r/p6zFqy/1/

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

How to replace part of string using regex pattern matching in scala?

I have a String which contains column names and datatypes as below:
val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"
My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes.
In this case,
numeric -> decimal(38,30)
numeric(15,10) -> decimal(15,10)
numeric(20,0) -> bigint (This is an integeral datatype as there its precision is zero.)
In order to access the datatype in the string: cdt, I split it and created a Seq from it.
val dt = cdt.split("\\|").toSeq
Now I have a Seq of elements in which each element is a String in the below format:
Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")
I have the pattern matching regex: """numeric\(\d+,(\d+)\)""".r, for numeric(precision, scale) which only works if there is a
scale of two digits, ex: numeric(20,23).
I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn"
dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
def changeDataType(colName: String, cd:String): String = {
val finalColumns = new String
val pattern1 = """numeric\(\d+,(\d+)\)""".r
cd match {
case pattern1.findFirstMatchIn(dt) =>
}
}
I am trying to get the final output into a String as below:
header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above.
Could anyone let me know how can I achieve it ?
It can be done with a single regex pattern, but some testing of the match results is required.
val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r
cdt.split("\\|")
.map{
case numericRE(col,a,b) =>
if (Option(b).isEmpty) s"$col:decimal(38,30)"
else if (b == "0") s"$col:bigint"
else s"$col:decimal($a,$b)"
case x => x //pass-through
}.mkString("|")
//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
Of course it can be done with three different regex patterns, but I think this is pretty clear.
explanation
raw - don't need so many escape characters - \
([^:]+) - capture everything up to the 1st colon
:numeric - followed by the string ":numeric"
(?: - start a non-capture group
\((\d+),(\d+)\) - capture the 2 digit strings, separated by a comma, inside parentheses
)? - the non-capture group is optional
numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null

Python RegEx query missing overlapping substrings

Python3.3, OS X 7.5
I am attempting to locate all instances of a 4-character substring defined as follows:
First character = 'N'
Second character = Anything but 'P'
Third character = 'S' or 'T'
Fourth character = Anything but 'P'
My query looks like this:
re.findall(r"\N[A-OQ-Z][ST][A-OQ-Z]", text)
This is working except in one particular case where two substrings overlap. That case involves the following 5character substring:
'...NNTSY...'
The query catches the first 4-character substring ('NNTS'), but not the second 4-character substring ('NTSY').
This is my first attempt at regular expressions, and obviously I'm missing something.
You can do this if the re engine does not consume characters as it matches them, which is possible with lookahead assertions:
import re
text = '...NNTSY...'
for m in re.findall(r'(?=(N[A-OQ-Z][ST][A-OQ-Z]))', text):
print(m)
Output:
NNTS
NTSY
Having everything within the assertion works but also feels weird. Another way is taking the N out of the assertion:
for m in re.findall(r'(N(?=([A-OQ-Z][ST][A-OQ-Z])))', text):
print(''.join(m))
From the Python 3 documentation (emphasis added):
$ python3 -c 'import re; help(re.findall)'
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
If you want overlapping instances, use regex.search() in a loop. You have to compile the regular expression because the API for non-compiled regular expressions doesn't take a parameter to specify the starting position.
def findall_overlapping(pattern, string, flags=0):
"""Find all matches, even ones that overlap."""
regex = re.compile(pattern, flags)
pos = 0
while True:
match = regex.search(string, pos)
if not match:
break
yield match
pos = match.start() + 1
(N[^P](?:S|T)[^P])
Edit live on Debuggex