Regular Expression: is it possible to get numbers in optional parts by one regex - regex

I have one string, it will be like: 1A2B3C or 2B3C or 1A2B or 1A3C.
The string is comprised by serval optional parts of number + [A|B|C].
It is possible to get the numbers before every character with one regex?
For example:
1A2B3C => (1, 2, 3)
1A3C => (1, 0, 3) There is no 'B', so gives 0 instead.
=> Or just (1, 3) but should show that the 3 is in front of 'C'.

Assuming Python because of your tuple notation, and because that's what I feel like using.
If the only allowed letters are A, B and C, you can do it with an extra processing step:
pattern = re.compile(r'(?:(\d+)A)(?:(\d+)B)?(?:(\d+)C)?')
match = pattern.fullmatch(some_string)
if match:
result = tuple(int(g) for g in match.groups('0'))
else:
raise ValueError('Bad input string')
Each option is surrounded by a non-capturing group (?:...) so the whole thing gets treated as a unit. Inside the unit, there is a capturing group (\d+) to capture the number, and an uncaptured character.
The method Matcher.groups returns a tuple of all the groups in the regex, with unmatched ones set to '0'. The generator then converts to an int for you. You could use tuple(map(int, match.groups('0'))).
You can also use a dictionary to hold the numbers, keyed by character:
pattern = re.compile(r'(?:(?P<A>\d+)A)(?:(?P<B>\d+)B)?(?:(?P<C>\d+)C)?')
match = pattern.fullmatch(some_string)
if match:
result = {k: int(v) for k, v in match.groupdict('0').items()}
else:
raise ValueError('Bad input string')
Matcher.groupdict is just like groups except that it returns a dictionary of the named groups: capture groups marked (?P<NAME>...).
Finally, if you don't mind having the dictionary, you can adapt this approach to parse any number of groups with arbitrary characters:
pattern = re.compile(r'(\d+)([A-Z])')
result = {}
while some_string:
match = pattern.match(some_string)
if not match:
raise ValueError('Bad input string')
result[match.group(2)] = int(match.group(1))
some_string = some_string[match.end():]

Related

Regex split string by two consecutive pipe ||

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end
You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

Regex - number of characters for sequence

I have the following pattern:
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
//etc
There is always one letter and digits.
I need to create a regex which uses number from the tag and applies it to the sequence between tags. I know that I can use a backreference but I don't know how to construct a regex. Here is incomplete regex:
"^<tag-([2-9])>[A-Z][0-9]/*how to apply here number from the tag ?*/</tag-\\1>$"
Edit
The following strings are not matched:
<tag-2>11</tag-2> //missing letter
<tag-2>BB</tag-2> // missing digit
<tag-3>B123</tag-3> //too many digits
<tag-3>AA1</tag-3> //should be only one letter and two digits
<tag-4>N12</tag-4> //too few digits
Regular expressions cannot contain elements that are functions of the values of back-references (other than the back-references themselves). That's because regular expressions are static from the time they are constructed.
One could, however, extract the desired string, or conclude that the sting contains no valid substring, in two steps. First attempt to match the string against /<tag-(\d+)>, where the contents of the capture group, after being converted to an integer, equals the length of the string that begins with a capital letter and is followed by digits. That information can then be used to construct a second regular expression that is used to verify the remainder of the match and extract the desired string.
I will use Ruby to illustrate how that might be done here. The operations--and certainly the two regular expressions--should be clear even to readers who are not familiar with Ruby.
Code
R = /<tag-(\d+)>/ # a constant
def doit(str)
m = str.match(R) # obtain a MatchData object; else nil
return nil if m.nil? # finished if no match
n = m[1].to_i-1 # required number of digits
r = /\A\p{Lu}\d{#{n}}(?=<\/tag-#{m[1]}>)/
# regular expression for second match
str[m.end(0).to_i..-1][r] # extract the desired string; else nil
end
Examples
arr = <<_.each_line.map(&:chomp)
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
<tag-2>11</tag-2>
<tag-2>BB</tag-2>
<tag-3>B123</tag-3>
<tag-3>AA1</tag-3>
<tag-4>N12</tag-4>
_
#=> ["<tag-2>B1</tag-2>", "<tag-3>A12</tag-3>",
# "<tag-4>M123</tag-4>", "<tag-2>11</tag-2>",
# "<tag-2>BB</tag-2>", "<tag-3>B123</tag-3>",
# "<tag-3>AA1</tag-3>", "<tag-4>N12</tag-4>"]
arr.map do |line|
s = doit(line)
s = 'nil' if s.nil?
puts "#{line.ljust(22)}: #{s}"
end
<tag-2>B1</tag-2> : B1
<tag-3>A12</tag-3> : A12
<tag-4>M123</tag-4> : M123
<tag-2>11</tag-2> : nil
<tag-2>BB</tag-2> : nil
<tag-3>B123</tag-3> : nil
<tag-3>AA1</tag-3> : nil
<tag-4>N12</tag-4> : nil
Explanation
Note that (?=<\/tag-#{m[1]}>) (part of r in the body of the method) is a positive lookahead, meaning that "<\/tag-#{m[1]}>" (with #{m[1]} substituted out) must be matched, but is not part of the match that is returned.
The step-by-step calculations are as follows.
str = "<tag-2>B1</tag-2>"
m = str.match(R)
#=> #<MatchData "<tag-2>" 1:"2">
m[0]
#=> "<tag-2>" (match)
m[1]
#=> "2" (contents of capture group 1)
m.end(0)
#=> 7 (index of str where the match ends, plus 1)
m.nil?
#=> false (do not return)
n = m[1].to_i-1
#=> 1 (number of digits required)
r = /\A\p{Lu}\d{#{n}}(?=\<\/tag\-#{m[1]}\>)/
#=> /\A\p{Lu}\d{1}(?=\<\/tag\-2\>)/
s = str[m.end(0).to_i..-1]
#=> str[7..-1]
#=> "B1</tag-2>"
s[r]
#=> "B1"
It looks like you're trying to create a pattern that will interpret a number in order to determine how long a string should be. I don't know of any feature to automate this process in any regular expression engine, but it can be done in a more manual fashion by enumerating all cases which you wish to handle.
For example, tags 2 through 9 can be handled as such:
<tag-2>: ^<tag-2>[A-Z][0-9]</tag-2>$
<tag-3>: ^<tag-3>[A-Z][0-9]{2}</tag-3>$
<tag-4>: ^<tag-4>[A-Z][0-9]{3}</tag-4>$
<tag-5>: ^<tag-5>[A-Z][0-9]{4}</tag-5>$
<tag-6>: ^<tag-6>[A-Z][0-9]{5}</tag-6>$
<tag-7>: ^<tag-7>[A-Z][0-9]{6}</tag-7>$
<tag-8>: ^<tag-8>[A-Z][0-9]{7}</tag-8>$
<tag-9>: ^<tag-9>[A-Z][0-9]{8}</tag-9>$
By removing the grouping and back-references you eliminate some complications that can occur when trying to combine regular expression patterns and can produce the following:
^(<tag-2>[A-Z][0-9]</tag-2>|<tag-3>[A-Z][0-9]{2}</tag-3>|<tag-4>[A-Z][0-9]{3}</tag-4>|<tag-5>[A-Z][0-9]{4}</tag-5>|<tag-6>[A-Z][0-9]{5}</tag-6>|<tag-7>[A-Z][0-9]{6}</tag-7>|<tag-8>[A-Z][0-9]{7}</tag-8>|<tag-9>[A-Z][0-9]{8}</tag-9>)$

How to replace part of string using regex pattern matching in scala?

I have a String which contains column names and datatypes as below:
val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"
My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes.
In this case,
numeric -> decimal(38,30)
numeric(15,10) -> decimal(15,10)
numeric(20,0) -> bigint (This is an integeral datatype as there its precision is zero.)
In order to access the datatype in the string: cdt, I split it and created a Seq from it.
val dt = cdt.split("\\|").toSeq
Now I have a Seq of elements in which each element is a String in the below format:
Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")
I have the pattern matching regex: """numeric\(\d+,(\d+)\)""".r, for numeric(precision, scale) which only works if there is a
scale of two digits, ex: numeric(20,23).
I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn"
dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
def changeDataType(colName: String, cd:String): String = {
val finalColumns = new String
val pattern1 = """numeric\(\d+,(\d+)\)""".r
cd match {
case pattern1.findFirstMatchIn(dt) =>
}
}
I am trying to get the final output into a String as below:
header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above.
Could anyone let me know how can I achieve it ?
It can be done with a single regex pattern, but some testing of the match results is required.
val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r
cdt.split("\\|")
.map{
case numericRE(col,a,b) =>
if (Option(b).isEmpty) s"$col:decimal(38,30)"
else if (b == "0") s"$col:bigint"
else s"$col:decimal($a,$b)"
case x => x //pass-through
}.mkString("|")
//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
Of course it can be done with three different regex patterns, but I think this is pretty clear.
explanation
raw - don't need so many escape characters - \
([^:]+) - capture everything up to the 1st colon
:numeric - followed by the string ":numeric"
(?: - start a non-capture group
\((\d+),(\d+)\) - capture the 2 digit strings, separated by a comma, inside parentheses
)? - the non-capture group is optional
numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null

Extract table key-values from LUA code

I have multiple strings from LUA code, each one with a LUA table item, something like:
atable['akeyofthetable'] = { 'name' = 'a name', 'thevalue' = 34, 'anotherkey' = 'something' }
The string might be spanned in multiple lines, meaning it might be:
atable['akeyofthetable'] = { 'name' = 'a name',
'thevalue' = 34,
"anotherkey" = 'something' }
How to get some (ex: only name and anotherkey in the above example) of the keys with their values as "re.match" objects in python3 from that string? Because this is taken from code, the existence of keys is not guarantied, the "quoting" of keys and values (double or single quotes) may vary, even from key to key, and there may be empty values ('name' = '') or non quoted strings as values ('thevalue' = anonquotedstringasvalue). Even the order of the keys is not guarantied. Split using commas (,) is not working because some string values have commas (ex: 'anotherkey' = 'my beloved, strange, value' or even 'anotherkey' = "my beloved, 'strange' = 34, value"). Also keys may or may not be quoted (depends, if names are in ASCII probably will not be quoted).
Is it possible to do this using one regex or I must do multiple searches for every key needed?
Code
If there is a possibility of escaped quotes \' or \" within the string, you can substitute the respective capture groups for '((?:[^'\\]|\\.)*)' as seen here.
See regex in use here
['\"](?:name|anotherkey)['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")
Usage
See code in use here
import re
keys = [
"name",
"anotherkey"
]
r = r"['\"](" + "|".join([re.escape(key) for key in keys]) + r")['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")"
s = "atable['akeyofthetable'] = { 'name' = 'a name',\n\t 'thevalue' = 34, \n\t \"anotherkey\" = 'something' }"
print(re.findall(r, s))
Explanation
The second point below is replaced by a join of the keys array.
['\"] Match any character in the set '"
(name|anotherkey) Capture the key into capture group 1
['\"] Match any character in the set '"
\s* Match any number of whitespace characters
= Match this literally
\s* Match any number of whitespace characters
(?:'([^']*)'|\"([^\"]*)\") Match either of the following
'([^']*)' Match ', followed by any character except ' any number of times, followed by '
\"([^\"]*)\" Match ", followed by any character except " any number of times, followed by "