Regex - capture multiple groups and combine them multiple times in one string - regex

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))

I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"

str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

Related

Regex to match a Number between two strings [duplicate]

I have a string that looks like the following:
<#399969178745962506> hello to <#!104729417217032192>
I have a dictionary containing both that looks like following:
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
My goal here is to replace the <#399969178745962506> into the value of that number key, which in this case would be One
Regex.Replace(arg.Content, "(?<=<)(.*?)(?=>)", m => userDic.ContainsKey(m.Value) ? userDic[m.Value] : m.Value);
My current regex is as following: (?<=<)(.*?)(?=>) which only matches everything in between < and > which would in this case leave both #399969178745962506 and #!104729417217032192
I can't just ignore the # sign, because the ! sign is not there every time. So it could be optimal to only get numbers with something like \d+
I need to figure out how to only get the numbers between < and > but I can't for the life of me figure out how.
Very grateful for any help!
In C#, you may use 2 approaches: a lookaround based on (since lookbehind patterns can be variable width) and a capturing group approach.
Lookaround based approach
The pattern that will easily help you get the digits in the right context is
(?<=<#!?)\d+(?=>)
See the regex demo
The (?<=<#!?) is a positive lookbehind that requires <= or <=! immediately to the left of the current location and (?=>) is a positive lookahead that requires > char immediately to the right of the current location.
Capturing approach
You may use the following pattern that will capture the digits inside the expected <...> substrings:
<#!?(\d+)>
Details
<# - a literal <# substring
!? - an optional exclamation sign
(\d+) - capturing group 1 that matches one or more digits
> - a literal > sign.
Note that the values you need can be accessed via match.Groups[1].Value as shown in the snippet above.
Usage:
var userDic = new Dictionary<string, string> {
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
};
var p = #"<#!?(\d+)>";
var s = "<#399969178745962506> hello to <#!104729417217032192>";
Console.WriteLine(
Regex.Replace(s, p, m => userDic.ContainsKey(m.Groups[1].Value) ?
userDic[m.Groups[1].Value] : m.Value
)
); // => One hello to Two
// Or, if you need to keep <#, <#! and >
Console.WriteLine(
Regex.Replace(s, #"(<#!?)(\d+)>", m => userDic.ContainsKey(m.Groups[2].Value) ?
$"{m.Groups[1].Value}{userDic[m.Groups[2].Value]}>" : m.Value
)
); // => <#One> hello to <#!Two>
See the C# demo.
To extract just the numbers from you're given format, use this regex pattern:
(?<=<#|<#!)(\d+)(?=>)
See it work in action: https://regexr.com/3j6ia
You can use non-capturing groups to exclude parts of the needed pattern to be inside the group:
(?<=<)(?:#?!?)(.*?)(?=>)
alternativly you could name the inner group and use the named group to get it:
(?<=<)(?:#?!?)(?<yourgroupname>.*?)(?=>)
Access it via m.Groups["yourgroupname"].Value - more see f.e. How do I access named capturing groups in a .NET Regex?
Regex: (?:<#!?(\d+)>)
Details:
(?:) Non-capturing group
<# matches the characters <# literally
? Matches between zero and one times
(\d+) 1st Capturing Group \d+ matches a digit (equal to [0-9])
Regex demo
string text = "<#399969178745962506> hello to <#!104729417217032192>";
Dictionary<string, string> list = new Dictionary<string, string>() { { "399969178745962506", "One" }, { "104729417217032192", "Two" } };
text = Regex.Replace(text, #"(?:<#!?(\d+)>)", m => list.ContainsKey(m.Groups[1].Value) ? list[m.Groups[1].Value] : m.Value);
Console.WriteLine(text); \\ One hello to Two
Console.ReadLine();

Regex - number of characters for sequence

I have the following pattern:
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
//etc
There is always one letter and digits.
I need to create a regex which uses number from the tag and applies it to the sequence between tags. I know that I can use a backreference but I don't know how to construct a regex. Here is incomplete regex:
"^<tag-([2-9])>[A-Z][0-9]/*how to apply here number from the tag ?*/</tag-\\1>$"
Edit
The following strings are not matched:
<tag-2>11</tag-2> //missing letter
<tag-2>BB</tag-2> // missing digit
<tag-3>B123</tag-3> //too many digits
<tag-3>AA1</tag-3> //should be only one letter and two digits
<tag-4>N12</tag-4> //too few digits
Regular expressions cannot contain elements that are functions of the values of back-references (other than the back-references themselves). That's because regular expressions are static from the time they are constructed.
One could, however, extract the desired string, or conclude that the sting contains no valid substring, in two steps. First attempt to match the string against /<tag-(\d+)>, where the contents of the capture group, after being converted to an integer, equals the length of the string that begins with a capital letter and is followed by digits. That information can then be used to construct a second regular expression that is used to verify the remainder of the match and extract the desired string.
I will use Ruby to illustrate how that might be done here. The operations--and certainly the two regular expressions--should be clear even to readers who are not familiar with Ruby.
Code
R = /<tag-(\d+)>/ # a constant
def doit(str)
m = str.match(R) # obtain a MatchData object; else nil
return nil if m.nil? # finished if no match
n = m[1].to_i-1 # required number of digits
r = /\A\p{Lu}\d{#{n}}(?=<\/tag-#{m[1]}>)/
# regular expression for second match
str[m.end(0).to_i..-1][r] # extract the desired string; else nil
end
Examples
arr = <<_.each_line.map(&:chomp)
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
<tag-2>11</tag-2>
<tag-2>BB</tag-2>
<tag-3>B123</tag-3>
<tag-3>AA1</tag-3>
<tag-4>N12</tag-4>
_
#=> ["<tag-2>B1</tag-2>", "<tag-3>A12</tag-3>",
# "<tag-4>M123</tag-4>", "<tag-2>11</tag-2>",
# "<tag-2>BB</tag-2>", "<tag-3>B123</tag-3>",
# "<tag-3>AA1</tag-3>", "<tag-4>N12</tag-4>"]
arr.map do |line|
s = doit(line)
s = 'nil' if s.nil?
puts "#{line.ljust(22)}: #{s}"
end
<tag-2>B1</tag-2> : B1
<tag-3>A12</tag-3> : A12
<tag-4>M123</tag-4> : M123
<tag-2>11</tag-2> : nil
<tag-2>BB</tag-2> : nil
<tag-3>B123</tag-3> : nil
<tag-3>AA1</tag-3> : nil
<tag-4>N12</tag-4> : nil
Explanation
Note that (?=<\/tag-#{m[1]}>) (part of r in the body of the method) is a positive lookahead, meaning that "<\/tag-#{m[1]}>" (with #{m[1]} substituted out) must be matched, but is not part of the match that is returned.
The step-by-step calculations are as follows.
str = "<tag-2>B1</tag-2>"
m = str.match(R)
#=> #<MatchData "<tag-2>" 1:"2">
m[0]
#=> "<tag-2>" (match)
m[1]
#=> "2" (contents of capture group 1)
m.end(0)
#=> 7 (index of str where the match ends, plus 1)
m.nil?
#=> false (do not return)
n = m[1].to_i-1
#=> 1 (number of digits required)
r = /\A\p{Lu}\d{#{n}}(?=\<\/tag\-#{m[1]}\>)/
#=> /\A\p{Lu}\d{1}(?=\<\/tag\-2\>)/
s = str[m.end(0).to_i..-1]
#=> str[7..-1]
#=> "B1</tag-2>"
s[r]
#=> "B1"
It looks like you're trying to create a pattern that will interpret a number in order to determine how long a string should be. I don't know of any feature to automate this process in any regular expression engine, but it can be done in a more manual fashion by enumerating all cases which you wish to handle.
For example, tags 2 through 9 can be handled as such:
<tag-2>: ^<tag-2>[A-Z][0-9]</tag-2>$
<tag-3>: ^<tag-3>[A-Z][0-9]{2}</tag-3>$
<tag-4>: ^<tag-4>[A-Z][0-9]{3}</tag-4>$
<tag-5>: ^<tag-5>[A-Z][0-9]{4}</tag-5>$
<tag-6>: ^<tag-6>[A-Z][0-9]{5}</tag-6>$
<tag-7>: ^<tag-7>[A-Z][0-9]{6}</tag-7>$
<tag-8>: ^<tag-8>[A-Z][0-9]{7}</tag-8>$
<tag-9>: ^<tag-9>[A-Z][0-9]{8}</tag-9>$
By removing the grouping and back-references you eliminate some complications that can occur when trying to combine regular expression patterns and can produce the following:
^(<tag-2>[A-Z][0-9]</tag-2>|<tag-3>[A-Z][0-9]{2}</tag-3>|<tag-4>[A-Z][0-9]{3}</tag-4>|<tag-5>[A-Z][0-9]{4}</tag-5>|<tag-6>[A-Z][0-9]{5}</tag-6>|<tag-7>[A-Z][0-9]{6}</tag-7>|<tag-8>[A-Z][0-9]{7}</tag-8>|<tag-9>[A-Z][0-9]{8}</tag-9>)$

How to replace part of string using regex pattern matching in scala?

I have a String which contains column names and datatypes as below:
val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"
My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes.
In this case,
numeric -> decimal(38,30)
numeric(15,10) -> decimal(15,10)
numeric(20,0) -> bigint (This is an integeral datatype as there its precision is zero.)
In order to access the datatype in the string: cdt, I split it and created a Seq from it.
val dt = cdt.split("\\|").toSeq
Now I have a Seq of elements in which each element is a String in the below format:
Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")
I have the pattern matching regex: """numeric\(\d+,(\d+)\)""".r, for numeric(precision, scale) which only works if there is a
scale of two digits, ex: numeric(20,23).
I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn"
dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
def changeDataType(colName: String, cd:String): String = {
val finalColumns = new String
val pattern1 = """numeric\(\d+,(\d+)\)""".r
cd match {
case pattern1.findFirstMatchIn(dt) =>
}
}
I am trying to get the final output into a String as below:
header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above.
Could anyone let me know how can I achieve it ?
It can be done with a single regex pattern, but some testing of the match results is required.
val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r
cdt.split("\\|")
.map{
case numericRE(col,a,b) =>
if (Option(b).isEmpty) s"$col:decimal(38,30)"
else if (b == "0") s"$col:bigint"
else s"$col:decimal($a,$b)"
case x => x //pass-through
}.mkString("|")
//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
Of course it can be done with three different regex patterns, but I think this is pretty clear.
explanation
raw - don't need so many escape characters - \
([^:]+) - capture everything up to the 1st colon
:numeric - followed by the string ":numeric"
(?: - start a non-capture group
\((\d+),(\d+)\) - capture the 2 digit strings, separated by a comma, inside parentheses
)? - the non-capture group is optional
numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null

Compound Words - Regex [duplicate]

I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?
When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
If it's an option, split on delimiters instead
Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
Do note that these two aren't exactly equivalent, but it may be an option
Do multilevel matching:
Capture the repeated group in one match
Then run another regex to break that match apart
References
regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Javascript flavor notes
Example
Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):
var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";
var r = /<(\w+(;\w+)*)>/g;
var match;
while ((match = r.exec(text)) != null) {
print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz
The pattern used is:
_2__
/ \
<(\w+(;\w+)*)>
\__________/
1
This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.
Related questions
How do you access the matched groups in a javascript regex?
How's about this? "foo bar baz".match(/(\w+)+/g)
Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
try using 'g':
"foo bar baz".match(/\w+/g)
You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz

Overlapping matches in Regex - Scala

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.
val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray
When I run the script I get:
aba, ded, ghg
And it should be:
aba, aca, ded, dfd, ghg, gig
It does not detect overlapped combinations.
The way consists to enclose the whole pattern in a lookahead to consume only the start position:
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
m => println(m.group(1))
}
The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).
You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:
object Main extends App {
val text = "abaca dedfd ghgig"
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
val allMatches = p.findAllMatchIn(text).map(_.group(1))
println(allMatches.mkString(", "))
// => aba, aca, ded, dfd, ghg, gig
}
See the online Scala demo
Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.