How to use regular expression in scala? - regex

I am learning Scala and spark and want to get the numbers out of string.
And for that i am using the regular expression. And came to know about the weird signature of using regular patterns in Scala.
Here is my code:
val myString: String = "there would be some number here 34."
val pattern = """.* ([\d]+).*""".r
val pattern(numberString) = myString
val num = numberString.toInt
println(answer)
The code is working fine, but seems a bit weird and less readable.
Is there any other way to do this in Scala? Or any other syntax which i can use?

The pattern-matching way you are extracting the number is rather resource consuming: since the pattern must match the whole string, you have to add .* on both ends of the regex, and that triggers a lot of backtracking. You also added a space to make sure the first .* does not eat all the digits on the left and return all 1+ digits found.
If you are looking for a first match, use findFirstIn:
val myString: String = "there would be some number here 34."
val numberString = """\d+""".r.findFirstIn(myString)
val num = numberString.get.toInt
println(num) // => 34
See the Scala demo.

Related

How to get all sub-strings of a specific format from a string

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.
Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.
What is the best way to do this?
I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.
This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.
You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.
Here is a Python solution,
import re
s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))
Prints,
['someword', 'some other word']
I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.
String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}
Prints,
someword
some other word
Let me know if this is what you were looking for.
Scala solution:
val text = "[[someword1]] test [[someword2]] test 1231"
val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
.findAllIn(text)
.matchData
.map(_.group(1)) //get 1st group
.toList
println(values)

How to find the whole word in kotlin using regex

I want to find the whole word in string. But I don't know how to find the all word in kotlin. My finding word is [non alpha]cba[non alpha]. My example code is bellows.
val testLink3 = """cba#cba cba"""
val word = "cba"
val matcher = "\\b[^a-zA-Z]*(?i)$word[^a-zA-Z]*\\b".toRegex()
val ret = matcher.find(testLink3)?.groupValues
But output of my source code is "cba"
My expected value is string array such as "{cba, cba, cba}".
How to find this value in kotlin language.
You may use
val testLink3 = """cBa#Cba cbA123"""
val word = "cba"
val matcher = "(?i)(?<!\\p{L})$word(?!\\p{L})".toRegex()
println(matcher.findAll(testLink3).map{it.value}.toList() )
println(matcher.findAll(testLink3).count() )
// => [cBa, Cba, cbA]
// => 3
See the online Kotlin demo.
To obtain the list of matches, you need to run the findAll method on the regex object, map the results to values and cast to list:
.findAll(testLink3).map{it.value}.toList()
To count the matches, you may use
matcher.findAll(testLink3).count()
Regex demo
(?i) - case insensitive modifier
(?<!\\p{L}) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a letter
$word - your word variable value
(?!\\p{L}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a letter.

Regex on nested arrays?

I've got a string of text that looks like such:
...],null,null,
],
["Tuesday",["8AM–5:30PM"]
,null,null,"2018-09-25",1,[[8,0,17,30]
]
,0]
,["Wednesday",["8AM–5:30PM"]
,null,null,"2018-09-26",1,[[8,0,17,30]
]
,0]
,["Thursday",["8AM–5:30PM"]
,null,null,"2018-09-27",1,[[8,0,17,30]
],x,y,[[[.....
I know this ends with three consecutive left brackets.
I'm writing a regex to grab all the arrays starting from the first day to the end of the array of the last day, but I'm having trouble getting too much returned.
val regEx = """[a-zA-Z]*(day)(?s)(.*)(\[\[\[\")""".r
I'm using the (?s)(.*) to capture the fact that there can be newlines between day arrays.
This is essentially grabbing everything from the text following the first day rather than stopping at the [[[.
How can I resolve this issue?
Scala regex defaults to anchored, but your text string doesn't end with the target [[[. There's more after that so you want it unanchored.
You put the text day in a capture group, which seems rather pointless in that you're losing the part that identifies which day you're starting with.
Why put the closing [[[ in a capture group? I don't see its purpose.
Your regex pattern ends with a single quote, ", but that's not in the sample string so this pattern won't match at all, even though you claim it's "grabbing everything ... rather than stopping at the [[[". You should make sure that the code you post fails in the way you describe.
The title of you question mentions "nested arrays" but there are no arrays, nested or otherwise. You have a String that you are trying to parse. Perhaps something like this:
val str = """Tuesday",["8AM–5:30PM"]
,null,null,"2018-09-25",1,[[8,0,17,30]
]
,0]
,["Wednesday",["8AM–5:30PM"]
,null,null,"2018-09-26",1,[[8,0,17,30]
]
,0]
,["Thursday",["8AM–5:30PM"]
,null,null,"2018-09-27",1,[[8,0,17,30]
],x,y,[[[....."""
val regEx = """([a-zA-Z]*day)(?s)(.*)\[\[\[""".r.unanchored
str match {
case regEx(a,b) => s"-->>$a$b<<--"
case _ => "nope"
}
I know this ends with three consecutive left brackets.
I'm writing a regex to grab this, but having trouble getting too much
returned
If you just need to grab that [[[, it can be done as below:
val str = """Tuesday",["8AM?5:30PM"]
,null,null,"2018-09-25",1,[[8,0,17,30]
]
,0]
,["Wednesday",["8AM?5:30PM"]
,null,null,"2018-09-26",1,[[8,0,17,30]
]
,0]
,["Thursday",["8AM?5:30PM"]
,null,null,"2018-09-27",1,[[8,0,17,30]
],x,y,[[[....."""
scala> val regEx = """\[\[\[""".r
regEx: scala.util.matching.Regex = \[\[\[
scala> regEx.findFirstIn(str).get
res20: String = [[[
If you have more [[[ in the str, you can use, regEx.findAllIn(str).toArray which returns
an Array("[[[",....)
scala> regEx.findAllIn(str).toArray
res22: Array[String] = Array([[[)

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Simple Regular Expression matching

Im new to regular expressions and Im trying to use RegExp on gwt Client side. I want to do a simple * matching. (say if user enters 006* , I want to match 006...). Im having trouble writing this. What I have is :
input = (006*)
input = input.replaceAll("\\*", "(" + "\\" + "\\" + "S\\*" + ")");
RegExp regExp = RegExp.compile(input).
It returns true with strings like BKLFD006* too. What am I doing wrong ?
Put a ^ at the start of the regex you're generating.
The ^ character means to match at the start of the source string only.
I think you are mixing two things here, namely replacement and matching.
Matching is used when you want to extract part of the input string that matches a specific pattern. In your case it seems that is what you want, and in order to get one or more digits that are followed by a star and not preceded by anything then you can use the following regex:
^[0-9]+(?=\*)
and here is a Java snippet:
String subjectString = "006*";
String ResultString = null;
Pattern regex = Pattern.compile("^[0-9]+(?=\\*)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
On the other hand, replacement is used when you want to replace a re-occurring pattern from the input string with something else.
For example, if you want to replace all digits followed by a star with the same digits surrounded by parentheses then you can do it like this:
String input = "006*";
String result = input.replaceAll("^([0-9]+)\\*", "($1)");
Notice the use of $1 to reference the digits that where captured using the capture group ([0-9]+) in the regex pattern.