Unable to identify comma "," with regex using scala 2.10.3 - regex

I am currently writing a function with uses the UNIX ls -m command to list a bunch of files, and then transform them into a list using a regex.
My function is as follows:
def genFileList(path : String = "~") : Iterator[String] = {
val fileSeparatorRegex: Regex = "(.*),".r
val fullCommand : String = s"ls -m $path"
val rawFileList: String = fullCommand.!!
val files: Iterator[String] = fileSeparatorRegex.findAllIn(rawFileList).matchData.map(_.group(1))
var debug : List[String] = files.toList
debug
files
}
For example: let's assume I have a folder called test with 3 files: test.txt test1.txt test2.txt. The resulting list is:
Very strange...
Lets change the function to:
def genFileList(path : String = "~") : Iterator[String] = {
val fileSeparatorRegex: Regex = "(.*)\\n".r \\ Changed to match newline
val fullCommand : String = s"ls -1 $path" \\ Changed to give file name separated via newline
val rawFileList: String = fullCommand.!!
val files: Iterator[String] = fileSeparatorRegex.findAllIn(rawFileList).matchData.map(_.group(1))
var debug : List[String] = files.toList
debug
files
}
Tadaaaa:
Can anybody help me make sense of the first case failing?
Why do the commas generated by ls -m not get matched?

(.*) is a greedy pattern, it tries to match as much as it can, including the commas
test1.txt, test2.txt, test3.txt
^------------------^^
all of this is |
matched by .* this is matched by ,
The last chunk is not matched, because it's not followed by a comma.
You can use non-greedy matching using .*?
Alternatively, you can to just do rawFileList.stripSuffix("\n").split(", ").toList
Also, "ls -m ~".!! doesn't work, splitting output on commas won't work if filenames contain commas, "s"ls -m $path".!! is asking for shell injection, and new File(path).list() is way better in all aspects.

I can see two problems with your initial approach. The first is that the * in your regex is greedy, which means it's sucking up as much as possible before reaching a comma, including other commas. If you change it to non-greedy by adding a ? (i.e. "(.*?),".r) it will only match up to the first comma.
The second problem is that there's no comma following the last file (naturally), so it won't be found by the regex. In your second approach you're getting all three files because there's a newline after each of them. If you want to stick with commas you'd be better off using split (e.g. rawFileList.split(",")).
You might also consider using the list or listFiles methods on java.io.File:
scala> val dir = new java.io.File(".")
f: java.io.File = .
scala> dir.list
res0: Array[String] = Array(test, test1.txt, test2.txt)

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

scala regex pattern giving matcherror

I have a string
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
the pattern
.*version_partition=(\d+)(.*)
is working as expected in regex101.com.
Requirement is to extract two strings. one is "8" (exactly after version_partition=)and another is "/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
In scala REPL the same pattern is giving scala.MatchError. I am new in using regular expressions. Not sure what I am doing wrong here. Please help me here.
scala code is
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
I have tried with /g and /m flag also. It didn't work.
Your code works : https://scalafiddle.io/sf/Xz1Y0Ze/0
You don't need /g and /m flag.
/g ==> Perform a global match (find all matches rather than stopping
after the first match)
/m ==> Perform multiline matching
code :
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
println(ver)
println(fileName)
Try it using a match like this:
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
path match {
case P(a,b) ⇒
println(a)
println(b)
}
Test
You accidentally added a white space at the end.
https://regex101.com/r/FLkZEu/2
The .* at the beginning of the regex is useless

Regular Expression Split

I have a string as mentioned below. I have been trying to split using regular expression and going through the forums, I found ([^|]+) which would match everything except (pipe) However I want to break this into two using regular expressions, but not been able to do this. So one expression would be (xyz) which would extract from GA till everything before the pipe character, the second would be (abc) which would extract anything after the first pipe.
GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298
The first is ^[^|]+ and the second is [^|]+$.
The idea is to use your negated character class with anchors. ^ will match the string start and $ will matchthe string end.
These two patterns have no lookarounds and will work with almost any regex flavor.
Guessing at popular languages. :-)
Python:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
JavaScript:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
PHP:
explode('|', 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298')
Go:
strings.Split("GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298", "|")
Ruby:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
EDIT
After clarification, I get what you're asking. Fiddling with regex101.com, I found that those two expressions should give you what you want:
^.*(?=\|) gets the first part, and
(?<=\|).* gets the second.
When you click on the link, you can see it in action.
PREVIOUS ANSWER
Many alternatives to regular expressions as #smarx's answer reveals.
But something along those lines should do it:
R
myString <- 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'
part1 <- sub(pattern = "(.*)\\|(.*)", x = myString, replacement = "\\1")
part2 <- sub(pattern = "(.*)\\|(.*)", x = myString, replacement = "\\2")
R requires doubling all backslashes, some other languages don't.
Python
import re
myString = 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'
part1 = re.sub(pattern="(.*)\|(.*)", repl = "\\1", string = myString)
part1 = re.sub(pattern="(.*)\|(.*)", repl = "\\2", string = myString)

Incorrect use of regex wildcards

This is not correct use of wildcards ? I'm attempting to match String that contains a date. I don't want to include the date in the returned String or the String value that prepends the matched String.
object FindText extends App{
val toFind = "find1"
val line = "this is find1 the line 1 \n 21/03/2015"
val find = (toFind+".*\\d{2}/\\d{2}/\\d{4}").r
println(find.findFirstIn(line))
}
Output should be : "find1 the line 1 \n "
but String is not found.
Dot does not match newline characters by default. You can set a DOTALL flag to make it happen (I have also added a "positive look-ahead - the (?=...) thingy - since you did not want the date to be included in the match": val find = (toFind+"""(?s).*(?=\d{2}/\d{2}/\d{4})""").r
(Note also, that in scala you do not need to escape special characters in strings, enclosed in a triple-quote pairs ... pretty neat).
The problem lies with the newline in the test string. A .* does not match newlines apparently. Replacing this with .*\\n?.* should fix it. One could also use a multiline flag in the regex such as:
val find = ("(?s)"+toFind+".*\\d{2}/\\d{2}/\\d{4}").r

Trying to match a string in the format of domain\username using Lua and then mask the pattern with '#'

I am trying to match a string in the format of domain\username using Lua and then mask the pattern with #.
So if the input is sample.com\admin; the output should be ######.###\#####;. The string can end with either a ;, ,, . or whitespace.
More examples:
sample.net\user1,hello -> ######.###\#####,hello
test.org\testuser. Next -> ####.###\########. Next
I tried ([a-zA-Z][a-zA-Z0-9.-]+)\.?([a-zA-Z0-9]+)\\([a-zA-Z0-9 ]+)\b which works perfectly with http://regexr.com/. But with Lua demo it doesn't. What is wrong with the pattern?
Below is the code I used to check in Lua:
test_text="I have the 123 name as domain.com\admin as 172.19.202.52 the credentials"
pattern="([a-zA-Z][a-zA-Z0-9.-]+).?([a-zA-Z0-9]+)\\([a-zA-Z0-9 ]+)\b"
res=string.match(test_text,pattern)
print (res)
It is printing nil.
Lua pattern isn't regular expression, that's why your regex doesn't work.
\b isn't supported, you can use the more powerful %f frontier pattern if needed.
In the string test_text, \ isn't escaped, so it's interpreted as \a.
. is a magic character in patterns, it needs to be escaped.
This code isn't exactly equivalent to your pattern, you can tweek it if needed:
test_text = "I have the 123 name as domain.com\\admin as 172.19.202.52 the credentials"
pattern = "(%a%w+)%.?(%w+)\\([%w]+)"
print(string.match(test_text,pattern))
Output: domain com admin
After fixing the pattern, the task of replacing them with # is easy, you might need string.sub or string.gsub.
Like already mentioned pure Lua does not have regex, only patterns.
Your regex however can be matched with the following code and pattern:
--[[
sample.net\user1,hello -> ######.###\#####,hello
test.org\testuser. Next -> ####.###\########. Next
]]
s1 = [[sample.net\user1,hello]]
s2 = [[test.org\testuser. Next]]
s3 = [[abc.domain.org\user1]]
function mask_domain(s)
s = s:gsub('(%a[%a%d%.%-]-)%.?([%a%d]+)\\([%a%d]+)([%;%,%.%s]?)',
function(a,b,c,d)
return ('#'):rep(#a)..'.'..('#'):rep(#b)..'\\'..('#'):rep(#c)..d
end)
return s
end
print(s1,'=>',mask_domain(s1))
print(s2,'=>',mask_domain(s2))
print(s3,'=>',mask_domain(s3))
The last example does not end with ; , . or whitespace. If it must follow this, then simply remove the final ? from pattern.
UPDATE: If in the domain (e.g. abc.domain.org) you need to also reveal any dots before that last one you can replace the above function with this one:
function mask_domain(s)
s = s:gsub('(%a[%a%d%.%-]-)%.?([%a%d]+)\\([%a%d]+)([%;%,%.%s]?)',
function(a,b,c,d)
a = a:gsub('[^%.]','#')
return a..'.'..('#'):rep(#b)..'\\'..('#'):rep(#c)..d
end)
return s
end