Regex: match and tokenize in Scala - regex

I am trying to extract certain patterns from the input string. These patterns are +, - , *, / , (, ), log , integer and float numbers.
Here's example for the needed behavior:
//input string
var str = "log6*(12+5)/2-34.2"
//wanted result
var rightResp = Array("log","6","*","(","12","+","5",")","/","2","-","34.2")
I have tried to do this for some time but I have to admit that regex is not my specialty. Next piece of code shows where I am stuck:
import scala.util.matching.Regex
var str = "log6*(12+5)/2-34.2"
val pattern = new Regex("(\\+|-|log|\\*|\\/|[0-9]*\.?[0-9]*)")
pattern.findAllIn(str).toArray
Result is not good cause there is no matching for brackets "(" and ")" and also numbers , both integer(6,12,5,2) and float(34.2) are messed up. Thanks for your help!

You can use
[+()*/-]|log|[0-9]*\\.?[0-9]+
See regex demo
The regex contains 3 alternatives joined with the help of | alternation operator.
[+()*/-] - matches a single literal character: +, (, ), *, /, - (note that the hyphen is not escaped as it is at the end of the character class)
log - a literal letter sequence log
[0-9]*\\.?[0-9]+ - a float number that accepts values like .05, 5.55 as it matches...
[0-9]* - 0 or more digits
\\.? - and optional (1 or 0) literal periods
[0-9]+ - 1 or more digitis.
Here is a Scala code sample:
import scala.util.matching.Regex
object Main extends App {
var str = "log6*(12+5)/2-34.2"
val pattern = new Regex("[+()*/-]|log|[0-9]*\\.?[0-9]+")
val res = pattern.findAllIn(str).toArray
println(res.deep.mkString(", "))
}
Result: log, 6, *, (, 12, +, 5, ), /, 2, -, 34.2

Related

How to match in a single/common Regex Group matching or based on a condition

I would like to extract two different test strings /i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
and
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8
with a single RegEx and in Group-1.
By using this RegEx ^.[i,na,fm,d]+\/(.+([,\/])?(\/|.+=.+,\/).+\/[,](live.([^,]).).+_)?.+(640).*$ I can get the second string to match the desired result int/2021/11/25/,live_20211125_215206_
but the first string does not match in Group-1 and the missing expected test string 1 extraction is int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45
Any pointers on this is appreciated.
Thanks!
If you want both values in group 1, you can use:
^/(?:[id]|na|fm)/([^/\s]*/\d{4}/\d{2}/\d{2}/\S*?)(?:/,|[^_]+_)640(?:\D|$)
The pattern matches:
^ Start of string
/ Match literally
(?:[id]|na|fm) Match one of i d na fm
/ Match literally
( Capture group 1
[^/\s]*/ Match any char except a / or a whitespace char, then match /
\d{4}/\d{2}/\d{2}/ Match a date like pattern
\S*? Match optional non whitespace chars, as few as possible
) Close group 1
(?:/,|[^_]+_) Match either /, or 1+ chars other than _ and then match _
640 Match literally
(?:\D|$) Match either a non digits or assert end of string
See a regex demo and a go demo.
We can't know all the rules of how the strings your are matching are constructed, but for just these two example strings provided:
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(\/i/int/\d{4}/\d{2}/\d{2}/.*)(?:\/,|_[\w_]+)640`)
var str = `
/i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8`
match := re.FindAllStringSubmatch(str, -1)
for _, val := range match {
fmt.Println(val[1])
}
}

How do i pick file names with specified pattern in scala

OTC_omega_20210302.csv
CH_delta_20210302.csv
MD_omega_20210310.csv
CD_delta_20210310.csv
val hdfsPath = "/development/staging/abcd-efgh"
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
val files = fs.listStatus(new Path(s"${hdfsPath}")).filterNot(_.isDirectory).map(_.getPath)
val regX = "OTC_*[0-9].csv|CH_*[0-9].csv".stripMargin.r
val filteredFiles = files.filter(fName => regX.findFirstMatchIn(fName.getName).isDefined)
What is regex do i need to give if i need any file name that starts with either (OTC_ or CH_ ) and ends with YYYYMMDD.csv ?
As per the above files i need two outputs
OTC_omega_20210302.csv
CH_delta_20210302.csv
Please help
You can use
val regX = "^(?:OTC|CH)_.*[0-9]{8}\\.csv$".r
val regX = """^(?:OTC|CH)_.*[0-9]{8}\.csv$""".r
See the regex demo.
Details:
^ - start of string
(?:OTC|CH) - a non-capturing group matching either OTC or CH char sequences
_ - a _ char
.* - any zero or more chars other than line break chars, as many as possible
[0-9]{8} - eight digits
\. - a literal dot (note . matches any char other than a line break char, you must escape . to make it match a dot)
csv - a csv string
$ - end of string.

Better way to extract numbers from a string

I have been trying to change a string like this, {X=5, Y=9} to a string like this (5, 9), as it would be used as an on-screen coordinate.
I finally came up with this code:
Dim str As String = String.Empty
Dim regex As Regex = New Regex("\d+")
Dim m As Match = regex.Match("{X=9")
If m.Success Then str = m.Value
Dim s As Match = regex.Match("Y=5}")
If s.Success Then str = "(" & str & ", " & s.Value & ")"
MsgBox(str)
which does work, but surely there must be a better way to do this (I not familiar with Regex).
I have many to convert in my program, and doing it like above would be torturous.
You may use
Dim result As String = Regex.Replace(input, ".*?=(\d+).*?=(\d+).*", "($1, $2)")
The regex means
.*? - any 0+ chars other than newline chars as few as possible
= - an equals sign
(\d+) - Group 1: one or more digits
.*?= - any 0+ chars other than newline chars as few as possible and then a = char
(\d+) - Group 2: one or more digits
.* - any 0+ chars other than newline chars as many as possible
The $1 and $2 in the replacement pattern are replacement backreferences that point to the values stored in Group 1 and 2 memory buffer.

How to find any non-digit characters using RegEx in ABAP

I need a Regular Expression to check whether a value contains any other characters than digits between 0 and 9.
I also want to check the length of the value.
The RegEx I´ve made: ^([0-9]\d{6})$
My test value is: 123Z45 and 123456
The ABAP code:
FIND ALL OCCURENCES OF REGEX '^([0-9]\d{6})$' IN L_VALUE RESULTS DATA(LT_RESULTS).
I´m expecting a result in LT_RESULTS, when I´m testing the first test value '123Z45', because there is a non-digit character.
But LT_RESULTS is in nearly every test case empty.
Your expression ^([0-9]\d{6})$ translates to:
^ - start of input
( - begin capture group
[0-9] - a character between 0 and 9
\d{6} - six digits (digit = character between 0 and 9)
) - end capture group
$ - end of input
So it will only match 1234567 (7 digit strings), not 123456, or 123Z45.
If you just need to find a string that contains non digits you could use the following instead: ^\d*[^\d]+\d*$
* - previous element may occur zero, one or more times
[^\d] - ^ right after [ means "NOT", i.e. any character which is not a digit
+ - previous element may occur one or more times
Example:
const expression = /^\d*[^\d]+\d*$/;
const inputs = ['123Z45', '123456', 'abc', 'a21345', '1234f', '142345'];
console.log(inputs.filter(i => expression.test(i)));
You can also use this character class if you want to extract non-digit group:
DATA(l_guid) = '0074162D8EAA549794A4EF38D9553990680B89A1'.
DATA(regx) = '[[:alpha:]]+'.
DATA(substr) = match( val = l_guid
regex = regx
occ = 1 ).
It finds a first occured non-digit group of characters and shows it.
If you want to just check if they are exists or how much of them reside in your string, count built-in function is your friend:
DATA(how_many) = count( val = l_guid regex = regx ).
DATA(yes) = boolc( count( val = l_guid regex = regx ) > 0 ).
Match and count exist since ABAP 7.50.
If you don't need a Regular Expression for something more complex, ABAP has some nice comparison operators CO (Contains Only), CA, NA etc for you. Something like:
IF L_VALUE CO '0123456789' AND STRLEN( L_VALUE ) = 6.

String Replacing in Regex

I am trying to replace text in string using regex. I accomplished it in c# using the same pattern but in swift its not working as per needed.
Here is my code:
var pattern = "\\d(\\()*[x]"
let oldString = "2x + 3 + x2 +2(x)"
let newString = oldString.stringByReplacingOccurrencesOfString(pattern, withString:"*" as String, options:NSStringCompareOptions.RegularExpressionSearch, range:nil)
print(newString)
What I want after replacement is :
"2*x + 3 +x2 + 2*(x)"
What I am getting is :
"* + 3 + x2 +*)"
Try this:
(?<=\d)(?=x)|(?<=\d)(?=\()
This pattern matches not any characters in the given string, but zero width positions in between characters.
For example, (?<=\d)(?=x) This matches a position in between a digit and 'x'
(?<= is look behind assertion (?= is look ahead.
(?<=\d)(?=\() This matches the position between a digit and '('
So the pattern before escaping:
(?<=\d)(?=x)|(?<=\d)(?=\()
Pattern, after escaping the parentheses and '\'
\(?<=\\d\)\(?=x\)|\(?<=\\d\)\(?=\\\(\)