How to replace part of string using regex pattern matching in scala? - regex

I have a String which contains column names and datatypes as below:
val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"
My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes.
In this case,
numeric -> decimal(38,30)
numeric(15,10) -> decimal(15,10)
numeric(20,0) -> bigint (This is an integeral datatype as there its precision is zero.)
In order to access the datatype in the string: cdt, I split it and created a Seq from it.
val dt = cdt.split("\\|").toSeq
Now I have a Seq of elements in which each element is a String in the below format:
Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")
I have the pattern matching regex: """numeric\(\d+,(\d+)\)""".r, for numeric(precision, scale) which only works if there is a
scale of two digits, ex: numeric(20,23).
I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn"
dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
def changeDataType(colName: String, cd:String): String = {
val finalColumns = new String
val pattern1 = """numeric\(\d+,(\d+)\)""".r
cd match {
case pattern1.findFirstMatchIn(dt) =>
}
}
I am trying to get the final output into a String as below:
header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above.
Could anyone let me know how can I achieve it ?

It can be done with a single regex pattern, but some testing of the match results is required.
val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r
cdt.split("\\|")
.map{
case numericRE(col,a,b) =>
if (Option(b).isEmpty) s"$col:decimal(38,30)"
else if (b == "0") s"$col:bigint"
else s"$col:decimal($a,$b)"
case x => x //pass-through
}.mkString("|")
//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
Of course it can be done with three different regex patterns, but I think this is pretty clear.
explanation
raw - don't need so many escape characters - \
([^:]+) - capture everything up to the 1st colon
:numeric - followed by the string ":numeric"
(?: - start a non-capture group
\((\d+),(\d+)\) - capture the 2 digit strings, separated by a comma, inside parentheses
)? - the non-capture group is optional
numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null

Related

Regex split string by two consecutive pipe ||

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end
You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Regex to match a Number between two strings [duplicate]

I have a string that looks like the following:
<#399969178745962506> hello to <#!104729417217032192>
I have a dictionary containing both that looks like following:
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
My goal here is to replace the <#399969178745962506> into the value of that number key, which in this case would be One
Regex.Replace(arg.Content, "(?<=<)(.*?)(?=>)", m => userDic.ContainsKey(m.Value) ? userDic[m.Value] : m.Value);
My current regex is as following: (?<=<)(.*?)(?=>) which only matches everything in between < and > which would in this case leave both #399969178745962506 and #!104729417217032192
I can't just ignore the # sign, because the ! sign is not there every time. So it could be optimal to only get numbers with something like \d+
I need to figure out how to only get the numbers between < and > but I can't for the life of me figure out how.
Very grateful for any help!
In C#, you may use 2 approaches: a lookaround based on (since lookbehind patterns can be variable width) and a capturing group approach.
Lookaround based approach
The pattern that will easily help you get the digits in the right context is
(?<=<#!?)\d+(?=>)
See the regex demo
The (?<=<#!?) is a positive lookbehind that requires <= or <=! immediately to the left of the current location and (?=>) is a positive lookahead that requires > char immediately to the right of the current location.
Capturing approach
You may use the following pattern that will capture the digits inside the expected <...> substrings:
<#!?(\d+)>
Details
<# - a literal <# substring
!? - an optional exclamation sign
(\d+) - capturing group 1 that matches one or more digits
> - a literal > sign.
Note that the values you need can be accessed via match.Groups[1].Value as shown in the snippet above.
Usage:
var userDic = new Dictionary<string, string> {
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
};
var p = #"<#!?(\d+)>";
var s = "<#399969178745962506> hello to <#!104729417217032192>";
Console.WriteLine(
Regex.Replace(s, p, m => userDic.ContainsKey(m.Groups[1].Value) ?
userDic[m.Groups[1].Value] : m.Value
)
); // => One hello to Two
// Or, if you need to keep <#, <#! and >
Console.WriteLine(
Regex.Replace(s, #"(<#!?)(\d+)>", m => userDic.ContainsKey(m.Groups[2].Value) ?
$"{m.Groups[1].Value}{userDic[m.Groups[2].Value]}>" : m.Value
)
); // => <#One> hello to <#!Two>
See the C# demo.
To extract just the numbers from you're given format, use this regex pattern:
(?<=<#|<#!)(\d+)(?=>)
See it work in action: https://regexr.com/3j6ia
You can use non-capturing groups to exclude parts of the needed pattern to be inside the group:
(?<=<)(?:#?!?)(.*?)(?=>)
alternativly you could name the inner group and use the named group to get it:
(?<=<)(?:#?!?)(?<yourgroupname>.*?)(?=>)
Access it via m.Groups["yourgroupname"].Value - more see f.e. How do I access named capturing groups in a .NET Regex?
Regex: (?:<#!?(\d+)>)
Details:
(?:) Non-capturing group
<# matches the characters <# literally
? Matches between zero and one times
(\d+) 1st Capturing Group \d+ matches a digit (equal to [0-9])
Regex demo
string text = "<#399969178745962506> hello to <#!104729417217032192>";
Dictionary<string, string> list = new Dictionary<string, string>() { { "399969178745962506", "One" }, { "104729417217032192", "Two" } };
text = Regex.Replace(text, #"(?:<#!?(\d+)>)", m => list.ContainsKey(m.Groups[1].Value) ? list[m.Groups[1].Value] : m.Value);
Console.WriteLine(text); \\ One hello to Two
Console.ReadLine();

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

Extract number not in brackets from this string using regular expressions [70-(90)]

[15-]
[41-(32)]
[48-(45)]
[70-15]
[40-(64)]
[(128)-42]
[(128)-56]
I have these values for which I want to extract the value not in curled brackets. If there is more than one, then add them together.
What is the regular expression to do this?
So the solution would look like this:
[15-] -> 15
[41-(32)] -> 41
[48-(45)] -> 48
[70-15] -> 85
[40-(64)] -> 40
[(128)-42] -> 42
[(128)-56] -> 56
You would be over complicating if you go for a regex approach (in this case, at least), also, regular expressions does not support mathematical operations, as pointed out by #richardtallent.
You can use an approach as shown here to extract a substring which omits the initial and final square brackets, and then, use the Split (as shown here) and split the string in two using the dash sign. Lastly, use the Instr function (as shown here) to see if any of the substrings that the split yielded contains a bracket.
If any of the substrings contain a bracket, then, they are omitted from the addition, or they are added up if otherwise.
Regular expressions does not support performing math on the terms. You can loop through the groups that are matched and perform the math outside of Regex.
Here's the pattern to extract any number within the square brackets that are not in cury brackets:
\[
(?:(?:\d+|\([^\)]*\))-)*
(\d+)
(?:-[^\]]*)*
\]
Each number will be returned in $1.
This works by looking for a number that is prefixed by any number of "words" separated by dashes, where the "words" are either numbers themselves or parenthesized strings, and followed by, optionally, a dash and some other stuff before hitting the end brace.
If VBA's RegEx doesn't support uncaptured groups (?:), remove all of the ?:'s and your captured numbers will be in $3 instead.
A simpler pattern also works:
\[
(?:[^\]]*-)*
(\d+)
(?:-[^\]]*)*
\]
This simply looks for numbers delimited by dashes and allowing for the number to be at the beginning or end.
Private Sub regEx()
Dim RegexObj As New VBScript_RegExp_55.RegExp
RegexObj.Pattern = "\[(\(?[0-9]*?\)?)-(\(?[0-9]*?\)?)\]"
Dim str As String
str = "[15-]"
Dim Match As Object
Set Match = RegexObj.Execute(str)
Dim result As Integer
Dim value1 As Integer
Dim value2 As Integer
If Not InStr(1, Match.Item(0).submatches.Item(0), "(", 1) Then
value1 = Match.Item(0).submatches.Item(0)
End If
If Not InStr(1, Match.Item(0).submatches.Item(1), "(", 1) And Not Match.Item(0).submatches.Item(1) = "" Then
value2 = Match.Item(0).submatches.Item(1)
End If
result = value1 + value2
MsgBox (result)
End Sub
Fill [15-] with the other strings.
Ok! It's been 6 years and 6 months since the question was posted. Still, for anyone looking for something like that maybe now or in the future...
Step 1:
Trim Leading and Trailing Spaces, if any
Step 2:
Find/Search:
\]|\[|\(.*\)
Replace With:
<Leave this field Empty>
Step 3:
Trim Leading and Trailing Spaces, if any
Step 4:
Find/Search:
^-|-$
Replace With:
<Leave this field Empty>
Step 5:
Find/Search:
-
Replace With:
\+