Scala extract from list based on condition - list

I have a list of words as a list an I would like to extract words that are maybe of lengths between 5 and 10, I am using the following code but doesn't seem to work. Also i can use only val and not var.
val sentence = args(0)
val words = sentence.split(" ")
val fullsort = words.sortBy(w => w.length -> w)
val med = fullsort.map(x => if(x.length>3 && x.length<11) x)

val sentence = args(0)
val words = sentence.split(" ")
val results = words.filter(word => word.length >= 5 && word.length <= 10)

Try this
val sentence = args(0)
val words = sentence.split(" ")
val fullsort = words.sortBy(w => w.length -> w)
val med = fullsort collect {case x:String if (x.length >= 5 && x.length <= 10) => x}

Another alternative is to let a regex do more of the work for you:
val wordLimitRE = "\\b\\w{5,10}\\b".r
val wordIterator = wordLimitRE.findAllMatchIn(sentence).map {_.toString}
This particular regex starts with a word boundary pattern \b then a range limited match for a number of word characters \w{lower, upper} then finally another word boundary pattern \b
The method findAllMatchIn returns an Iterator[Regex.Match] for each match (matches don't overlap because of the word boundary patterns). The map {_.toString} returns an Iterator[String]

Related

Regular expression for identifying key value pair

I need to create a regular expression to identify key value pairs which are separated by commas. Key and value can have letters (both upper case and lower case), digits, special characters (- _ , / .). Following regular expression works when special characters follow alphanumeric characters but when the special character comes before alphanumeric characters, it does not work. For example "key1.=value1-;key2/=value2-" works but ".key1=value1-;key2/=value2-" does not work.
import scala.util.matching.Regex
object TestReges {
def main(args: Array[String]): Unit = {
//val inputPattern : String= "1one-=1;two=2"
//val inputPattern : String = "-"
//val inputPattern : String= "two"
val inputPattern : String= "key1-=value1;key2=,value2."
val tagValidator: Regex = "(?:(\\w*\\d*-*_*,*/*\\.*)=(\\w*\\d*-*_*,*/*\\.*)(?=;|$))".r
//Pattern p = Pattern.compile("(?:(^[a-z]+)=(^[a-z]+)(?=&|$))");
//Matcher m = p.matcher(input);
//System.out.println(m.groupCount());
println(tagValidator.findAllMatchIn(inputPattern).size)
// while (m.find()) {
// System.out.println("key="+m.group(1));
// System.out.println("value="+m.group(2));
// }
} }
Why not split on the separator and then just match everything on either side of the = that isn't an =?
//Scala code
val kvPair = "([^=]+)=([^=]+)".r
List("key1.=value1-;key2/=value2-",".key1=value1-;key2/=value2-")
.flatMap(_.split(";"))
.collect{case kvPair(k,v) => k -> v}
//res0: List[(String, String)] = List((key1.,value1-)
// , (key2/,value2-)
// , (.key1,value1-)
// , (key2/,value2-))
Why do you want to do it with regex as #shmosel specified why can't you just split it on ; and then on = ? like this
val inputPattern : String= "key1-=value1;key2=,value2."
inputPattern.split(";").map{ kvStr => kvStr.split("=") match
case Array(k, v) => (k, v)
}.toMap

How to group similar characters in a string in scala?

Lets assume I have a string as such:
val a = "aaaabbbcccss"
and I want to group only the a's and b's as such:
"a4b3cccss"
I have tries a.toList.groupBy(identity).mapValues(_.size) but that returns a map with no ordering so I cannot convert it into the form I want. I was wondering if there is a function in scala that can achieve what I want?
You may use
val a = "aaaabbbcccss"
val p = """([ab])\1*""".r
println(p replaceAllIn (a, m => s"${m.group(1)}${m.group(0).size}") )
See Scala demo
The regex matches:
([ab]) - Group 1: a or b
\1* - zero or more occurrences of the char captured into Group 1.
In the replacement part, m.group(1) is the char captured into Group 1 and m.group(0).size is the size of the whole match.
As an alternative, you might create a function which you can give your string and a list of characters and use a recursive approach where you could take consecutive characters from the list using takeWhile.
Then drop from the list using the length of the result from takewhile and add to the accumulator what you want to concatenate to the acc string which will be returned when the list will be empty.
def countSimilar(str: String, ch: List[Char]): String = {
def process(l: List[Char], acc: String = ""): String = {
l match {
case Nil => acc
case h :: _ =>
val tw = l.takeWhile(_ == h)
acc + process(
l.drop(tw.length),
if (ch.contains(h)) h + tw.length.toString else tw.mkString("")
)
}
}
process(str.toList)
}
println(countSimilar("aaaabbbcccss", List('a', 'b')))
println(countSimilar("aaaabbbcccssaaaabb", List('a', 'b', 'c')))
That will give you:
a4b3cccss
a4b3c3ssa4b2
See the Scala demo

How to split string by delimiter in scala?

I have a string like this:
val str = "3.2.1"
And I want to do some manipulations based on it.
I will share also what I want to do and it will be nice if you can share your suggestions:
im doing automation for some website, and based on this string I need to do some actions.
So:
the first digit - I will need to choose by value: value="str[0]"
the second digit - I will need to choose by value: value="str[0]+"."+str[1]"
the third digit - I will need to choose by value: value="str[0]+"."+str[1]+"."+str[2]"
as you can see the second field i need to choose is the name firstdigit.seconddigit and the third field is firstdigit.seconddigit.thirddigit
You can use pattern matching for this.
First create regex:
# val pattern = """(\d+)\.(\d+)\.(\d+)""".r
pattern: util.matching.Regex = (\d+)\.(\d+)\.(\d+)
then you can use it to pattern match:
# "3.4.342" match { case pattern(a, b, c) => println(a, b, c) }
(3,4,342)
if you don't need all numbers you can for example do this
"1.2.0" match { case pattern(a, _, _) => println(a) }
1
if you want to for example to take just first two numbers you can do
# val twoNumbers = "1.2.0" match { case pattern(a, b, _) => s"$a.$b" }
twoNumbers: String = "1.2"
Can only add to #Lukasz's answer one more variant with the values extration:
# val pattern = """(\d+)\.(\d+)\.(\d+)""".r
pattern: scala.util.matching.Regex = (\d+)\.(\d+)\.(\d+)
# val pattern(firstdigit, seconddigit, thirddigit) = "3.2.1"
firstdigit: String = "3"
seconddigit: String = "2"
thirddigit: String = "1"
This way all the values can be treated as regular vals further in the code.
val str="vaquar.khan"
val strArray=str.split("\\.")
strArray.foreach(println)
Try the following:
scala> "3.2.1".split(".")
res0: Array[java.lang.String] = Array(string1, string2, string3)
This one:
object Splitter {
def splitAndAccumulate(string: String) = {
val s = string.split("\\.")
s.tail.scanLeft(s.head){ case (acc, elem) =>
acc + "." + elem
}
}
}
passes this test:
test("Simple"){
val t = Splitter.splitAndAccumulate("1.2.3")
val answers = Seq("1", "1.2", "1.2.3")
t.zip(answers).foreach{ case (l, r) =>
assert(l == r)
}
}

apache-spark regex extract words from rdd

I try to extract words from a textfile.
Textfile:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
The following works well:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
But I want to extract the words for every line.
If i type
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
i get
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
What am I doing wrong?
Thanks in advance
Thank you for your Answer.
The goal was to count the occourence of words in a pos/neg-wordlist.
Seems this works:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))
Your problem is not exactly Apache Spark, your first map will make you handle a line, but your flatMap on that line will make you iterate on the characters in this line String. So Spark or not, your code won't work, for example in a Scala REPL :
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
So if you really want, using your regexp, all the words in your line, just use flatMap once :
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
Regards,

Extract numbers from string with rich string magic

I want to extract a list of ID of a string pattern in the following:
{(2),(4),(5),(100)}
Note: no leading or trailing spaces.
The List can have up to 1000 IDs.
I want to use rich string pattern matching to do this. But I tried for 20 minutes with frustration.
Could anyone help me to come up with the correct pattern? Much appreciated!
Here's brute force string manipulation.
scala> "{(2),(4),(5),(100)}".replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\\{","").replaceAll("\\}","").split(",")
res0: Array[java.lang.String] = Array(2, 4, 5, 100)
Here's a regex as #pst noted in the comments. If you don't want the parentheses change the regular expression to """\d+""".r.
val num = """\(\d+\)""".r
"{(2),(4),(5),(100)}" findAllIn res0
res33: scala.util.matching.Regex.MatchIterator = non-empty iterator
scala> res33.toList
res34: List[String] = List((2), (4), (5), (100))
"{(2),(4),(5),(100)}".split ("[^0-9]").filter(_.length > 0).map (_.toInt)
Split, where char is not part of a number, and only convert non-empty results.
Might be modified to include dots or minus signs.
Use Extractor object:
object MyList {
def apply(l: List[String]): String =
if (l != Nil) "{(" + l.mkString("),(") + ")}"
else "{}"
def unapply(str: String): Some[List[String]] =
Some(
if (str.indexOf("(") > 0)
str.substring(str.indexOf("(") + 1, str.lastIndexOf(")")) split
"\\p{Space}*\\)\\p{Space}*,\\p{Space}*\\(\\p{Space}*" toList
else Nil
)
}
// test
"{(1),(2)}" match { case MyList(l) => l }
// res23: List[String] = List(1, 2)