I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.
Given a string like abab/docId/example-doc1-2019-01-01, I want to use Regex to extract these values:
firstPart = example
fullString = example-doc1-2019-01-01
I have this:
import scala.util.matching.Regex
case class Read(theString: String) {
val stringFormat: Regex = """.*\/docId\/([A-Za-z0-9]+)-([A-Za-z0-9-]+)$""".r
val stringFormat(firstPart, fullString) = theString
}
But this separates it like this:
firstPart = example
fullString = doc1-2019-01-01
Is there a way to retain the fullString and do a regex on that to get the part before the first hyphen? I know I can do this using the String split method but is there a way do it using regex?
You may use
val stringFormat: Regex = ".*/docId/(([A-Za-z0-9])+-[A-Za-z0-9-]+)$".r
||_ Group 2 _| |
| |
|_________________ Group 1 __|
See the regex demo.
Note how capturing parentheses are re-arranged. Also, you need to swap the variables in the regex match call, see demo below (fullString should come before firstPart).
See Scala demo:
val theString = "abab/docId/example-doc1-2019-01-01"
val stringFormat = ".*/docId/(([A-Za-z0-9]+)-[A-Za-z0-9-]+)".r
val stringFormat(fullString, firstPart) = theString
println(s"firstPart: '$firstPart'\nfullString: '$fullString'")
Output:
firstPart: 'example'
fullString: 'example-doc1-2019-01-01'
If I am doing this it is working fine:
val string = "somestring;userid=someidT;otherstuffs"
var pattern = """[;?&]userid=([^;&]+)?(;|&|$)""".r
val result = pattern.findFirstMatchIn(string).get;
But I am getting an error when I am doing this
val string = "somestring;userid=someidT;otherstuffs"
val id_name = "userid"
var pattern = """[;?&]""" + id_name + """=([^;&]+)?(;|&|$)""".r
val result = pattern.findFirstMatchIn(string).get;
This is the error:
error: value findFirstMatchIn is not a member of String
You may use an interpolated string literal and use a bit simpler regex:
val string = "somestring;userid=someidT;otherstuffs"
val id_name = "userid"
var pattern = s"[;?&]${id_name}=([^;&]*)".r
val result = pattern.findFirstMatchIn(string).get.group(1)
println(result)
// => someidT
See the Scala demo.
The [;?&]$id_name=([^;&]*) pattern finds ;, ? or & and then userId (since ${id_name} is interpolated) and then = is matched and then any 0+ chars other than ; and & are captured into Group 1 that is returned.
NOTE: if you want to use a $ as an end of string anchor in the interpolated string literal use $$.
Also, remember to Regex.quote("pattern") if the variable may contain special regex operators like (, ), [, etc. See Scala: regex, escape string.
Add parenthesis around the string so that regex is made after the string has been constructed instead of the other way around:
var pattern = ("[;?&]" + id_name + "=([^;&]+)?(;|&|$)").r
// pattern: scala.util.matching.Regex = [;?&]userid=([^;&]+)?(;|&|$)
val result = pattern.findFirstMatchIn(string).get;
// result: scala.util.matching.Regex.Match = ;userid=someidT;
I have this text:
2|#Favo|Name||26.0000|50.10000|_GRE|||||City|Road||||
I want to capture anything between those special chars: ||
For example, I want to capture "Name" only or I want to capture "City"
I've spent many hours and all I came up with is this regex:
([^|].*[$|])\w+
Here are the required values:
How can I capture one of them?
Thank you.
You may split the string with | removing empty entries and also all those that are blank or consisting only of digits:
Dim strng As String = "2|#Favo|Name||26.0000|50.10000|_GRE|||||City|Road||||"
Dim reslt As List(Of String) = strng.Split(New String() {"|"}, StringSplitOptions.RemoveEmptyEntries).Where(
Function(m) m.All(AddressOf Char.IsDigit) = False And String.Equals(m.Trim(), String.Empty) = False).ToList()
Console.Write(String.Join(", ", reslt))
Output:
I tried to extract 'US' from the following string, but it returns null, any idea? Thanks
console.log('sample sample 1234 (US)'.match(/\(W{2}\)$/))
W matches a letter W. It should be \w{2} or better [A-Z]{2}. Capture it with (...) and access the 1st captured value:
console.log('sample sample 1234 (US)'.match(/\(([A-Z]{2})\)$/)[1]);
// Or, with error checking
let m = 'sample sample 1234 (US)'.match(/\(([A-Z]{2})\)$/);
let res = m ? m[1] : "";
console.log(res)
If you do not want to access the contents of the capturing group, you need to post-process the results of /\([A-Z]{2}\)$/ regex:
let m = 'sample sample 1234 (US)'.match(/\([A-Z]{2}\)$/);
let res = m[0].substring(1, m[0].length-1) || "";
console.log(res);