How do i pick file names with specified pattern in scala - regex

OTC_omega_20210302.csv
CH_delta_20210302.csv
MD_omega_20210310.csv
CD_delta_20210310.csv
val hdfsPath = "/development/staging/abcd-efgh"
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
val files = fs.listStatus(new Path(s"${hdfsPath}")).filterNot(_.isDirectory).map(_.getPath)
val regX = "OTC_*[0-9].csv|CH_*[0-9].csv".stripMargin.r
val filteredFiles = files.filter(fName => regX.findFirstMatchIn(fName.getName).isDefined)
What is regex do i need to give if i need any file name that starts with either (OTC_ or CH_ ) and ends with YYYYMMDD.csv ?
As per the above files i need two outputs
OTC_omega_20210302.csv
CH_delta_20210302.csv
Please help

You can use
val regX = "^(?:OTC|CH)_.*[0-9]{8}\\.csv$".r
val regX = """^(?:OTC|CH)_.*[0-9]{8}\.csv$""".r
See the regex demo.
Details:
^ - start of string
(?:OTC|CH) - a non-capturing group matching either OTC or CH char sequences
_ - a _ char
.* - any zero or more chars other than line break chars, as many as possible
[0-9]{8} - eight digits
\. - a literal dot (note . matches any char other than a line break char, you must escape . to make it match a dot)
csv - a csv string
$ - end of string.

Related

How to match in a single/common Regex Group matching or based on a condition

I would like to extract two different test strings /i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
and
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8
with a single RegEx and in Group-1.
By using this RegEx ^.[i,na,fm,d]+\/(.+([,\/])?(\/|.+=.+,\/).+\/[,](live.([^,]).).+_)?.+(640).*$ I can get the second string to match the desired result int/2021/11/25/,live_20211125_215206_
but the first string does not match in Group-1 and the missing expected test string 1 extraction is int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45
Any pointers on this is appreciated.
Thanks!
If you want both values in group 1, you can use:
^/(?:[id]|na|fm)/([^/\s]*/\d{4}/\d{2}/\d{2}/\S*?)(?:/,|[^_]+_)640(?:\D|$)
The pattern matches:
^ Start of string
/ Match literally
(?:[id]|na|fm) Match one of i d na fm
/ Match literally
( Capture group 1
[^/\s]*/ Match any char except a / or a whitespace char, then match /
\d{4}/\d{2}/\d{2}/ Match a date like pattern
\S*? Match optional non whitespace chars, as few as possible
) Close group 1
(?:/,|[^_]+_) Match either /, or 1+ chars other than _ and then match _
640 Match literally
(?:\D|$) Match either a non digits or assert end of string
See a regex demo and a go demo.
We can't know all the rules of how the strings your are matching are constructed, but for just these two example strings provided:
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(\/i/int/\d{4}/\d{2}/\d{2}/.*)(?:\/,|_[\w_]+)640`)
var str = `
/i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8`
match := re.FindAllStringSubmatch(str, -1)
for _, val := range match {
fmt.Println(val[1])
}
}

scala.MatchError when partitioning a string with an optional part into one or three parts

I am trying to pull out something like this:
params = {"path", "contentName"}
part of parametersStr below
#RequestMapping(value = "/breezeQuery", params = {"path", "contentName"}, method = RequestMethod.GET)
Why is this code giving me a scala.MatchError?:
val paramsPattern = """(.*)(?:params = \{.*})?(.*)""".r
val paramsPattern(left, paramsStr, right) = parametersStr
Also, the pattern like this may not occur in the string. So I also want to know if that is the case. Finally, I'm capturing everything to the left and right of the group so that I can concatenate them to remove the captured group from the string. It is optional, but I do want to capture it if it is present.
I believe you want to partition the string into 3 or 2 parts (depending on the optional params = \{.*}).
You may use
^(.*?)(?:(params\s*=\s*\{.*?})(.*))?$
See the regex demo. Details
^ - start of string
(.*?) - Group 1 (left):
(?:(params\s*=\s*\{.*?})(.*))? - an optional non-capturing group, will be tried at least once:
(params\s*=\s*\{.*?}) - Group 2 (paramsStr):params word, = enclosed with 0+ whitespaces, {, any zero or more chars other than line break chars, as fewas possible and then }
(.*) - Group 3: any zero or more chars other than line break chars, as many as possible
$ - end of string
See the Scala demo:
val parametersStr = """#RequestMapping(value = "/breezeQuery", params = {"path", "contentName"}, method = RequestMethod.GET)"""
val paramsPattern = """^(.*?)(?:(params\s*=\s*\{.*?})(.*))?$""".r
val paramsPattern(left, paramsStr, right) = parametersStr
println(s"Left: $left\nParam String: $paramsStr\nRight: $right")
Output:
Left: #RequestMapping(value = "/breezeQuery",
Param String: params = {"path", "contentName"}
Right: , method = RequestMethod.GET)

Better way to extract numbers from a string

I have been trying to change a string like this, {X=5, Y=9} to a string like this (5, 9), as it would be used as an on-screen coordinate.
I finally came up with this code:
Dim str As String = String.Empty
Dim regex As Regex = New Regex("\d+")
Dim m As Match = regex.Match("{X=9")
If m.Success Then str = m.Value
Dim s As Match = regex.Match("Y=5}")
If s.Success Then str = "(" & str & ", " & s.Value & ")"
MsgBox(str)
which does work, but surely there must be a better way to do this (I not familiar with Regex).
I have many to convert in my program, and doing it like above would be torturous.
You may use
Dim result As String = Regex.Replace(input, ".*?=(\d+).*?=(\d+).*", "($1, $2)")
The regex means
.*? - any 0+ chars other than newline chars as few as possible
= - an equals sign
(\d+) - Group 1: one or more digits
.*?= - any 0+ chars other than newline chars as few as possible and then a = char
(\d+) - Group 2: one or more digits
.* - any 0+ chars other than newline chars as many as possible
The $1 and $2 in the replacement pattern are replacement backreferences that point to the values stored in Group 1 and 2 memory buffer.

Regex Express Return All Chars before a '/' but if there are 2 '/' Return all before that

I have been trying to get a regex expression to return me the following in the following situations.
XX -> XX
XXX -> XXX
XX/XX -> XX
XX/XX/XX -> XX/XX
XXX/XXX/XX -> XXX/XXX
I had the following Regex, however they do no work.
^[^/]+ => https://regex101.com/r/xvCbNB/1
=========
([A-Z])\w+ => https://regex101.com/r/xvCbNB/2
They are close but are not there.
Any Help would be appreciated.
You want to get all text from the start till the last occurrence of a specific character or till the end of string if the character is missing.
Use
^(?:.*(?=\/)|.+)
See the regex demo and the regex graph:
Details
^ - start of string
(?:.*(?=\/)|.+) - a non-capturing group that matches either of the two alternatives, and if the first one matches first the second won't be tried:
.*(?=\/) - any 0+ chars other than line break chars, as many as possible upt to but excluding /
| - or
.+ - any 1+ chars other than line break chars, as many as possible.
It will be easier to use a replace here to match / followed by non-slash characters before end of line:
Search regex:
/[^/]*$
Replacement String:
""
Updated RegEx Demo 1
If you're looking for a regex match then use this regex:
^(.*?)(?:/[^/]*)?$
Updated RegEx Demo 2
Any special reason it has to be a regular expression? How about just splitting the string at the slashes, remove the last item and rejoin:
function removeItemAfterLastSlash(string) {
const list = string.split(/\//);
if (list.length == 1) [
return string;
}
list.pop();
return list.join("/");
}
Or look for the last slash an remove it:
function removeItemAfterLastSlash(string) {
const index = string.lastIndexOf("/");
if (index === -1) {
return string;
}
return string.splice(0, index);
}

Remove the text before second comma ('',") String replace pattern

how can we remove the text before the line that start's with second comma(line 5 in the example),how can i do that using regex?
example :
,
abc,xyz,ggg,nrmr
cde,jjj,kkkk,iiii,tem,posting
234,mm/dd/yy
,
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98
expected :
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98
It seems you may use
val s = """,
abc,xyz,ggg,nrmr
cde,jjj,kkkk,iiii,tem,posting
234,mm/dd/yy
,
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98"""
val res = s.replaceFirst("(?sm)\\A(.*?^,$){2}", "").trim()
println(res)
// =>
// 454654,output2,sample
// 45646,output1,non-sample
// 16546,225.02
// ABC,2.98
See the Scala demo.
Pattern details:
(?sm) - s enables . to match any char in the string including newlines, and m makes ^ and $ match start/end of line respectively
\\A - the start of string
(.*?^,$){2} - 2 occurrences of:
.*? - any 0+ chars as few as possible up to the leftmost
^,$ - line that only contains ,.