apache-spark regex extract words from rdd - regex

I try to extract words from a textfile.
Textfile:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
The following works well:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
But I want to extract the words for every line.
If i type
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
i get
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
What am I doing wrong?
Thanks in advance

Thank you for your Answer.
The goal was to count the occourence of words in a pos/neg-wordlist.
Seems this works:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))

Your problem is not exactly Apache Spark, your first map will make you handle a line, but your flatMap on that line will make you iterate on the characters in this line String. So Spark or not, your code won't work, for example in a Scala REPL :
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
So if you really want, using your regexp, all the words in your line, just use flatMap once :
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
Regards,

Related

Scala, regex matching ignore unnecessary words

My program is:
val pattern = "[*]prefix_([a-zA-Z]*)_[*]".r
val outputFieldMod = "TRASHprefix_target_TRASH"
var tar =
outputFieldMod match {
case pattern(target) => target
}
println(tar)
Basically, I try to get the "target" and ignore "TRASH" (I used *). But it has some error and I am not sure why..
Simple and straight forward standard library function (unanchored)
Use Unanchored
Solution one
Use unanchored on the pattern to match inside the string ignoring the trash
val pattern = "prefix_([a-zA-Z]*)_".r.unanchored
unanchored will only match the pattern ignoring all the trash (all the other words)
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """foo\((.*)\)""".r.unanchored
pattern: scala.util.matching.UnanchoredRegex = foo\((.*)\)
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res3: String = bar
Solution two
Pad your pattern from both sides with .*. .* matches any char other than a linebreak character.
val pattern = ".*prefix_([a-zA-Z]*)_.*".r
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """.*foo\((.*)\).*""".r
pattern: scala.util.matching.Regex = .*foo\((.*)\).*
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res4: String = bar
This will work, val pattern = ".*prefix_([a-z]+).*".r, but it distinguishes between target and trash via lower/upper-case letters. Whatever determines real target data from trash data will determine the real regex pattern.

strange behaviour with filter?

I want to extract MIME-like headers (starting with [Cc]ontent- ) from a multiline string:
scala> val regex = "[Cc]ontent-".r
regex: scala.util.matching.Regex = [Cc]ontent-
scala> headerAndBody
res2: String =
"Content-Type:application/smil
Content-ID:0.smil
content-transfer-encoding:binary
<smil><head>
"
This fails
scala> headerAndBody.lines.filter(x => regex.pattern.matcher(x).matches).toList
res4: List[String] = List()
but the "related" cases work as expected:
scala> headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
res5: List[String] = List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)
and:
scala> headerAndBody.lines.filter(x => x.startsWith("Content-")).toList
res8: List[String] = List(Content-Type:application/smil, Content-ID:0.smil)
what am I doing wrong in
x => regex.pattern.matcher(x).matches
since it returns an empty List??
The reason for the failure with the first line is that you use the java.util.regex.Matcher.matches() method that requires a full string match.
To fix that, use the Matcher.find() method that searches for the match anywhere inside the input string and use the "^[Cc]ontent-" regex (note that the ^ symbol will force the match to appear at the start of the string).
Note that this line of code does not work as you expect:
headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
You run the regex check against the pattern Content-, and it is always true (that is why you get all the lines in the result).
See this IDEONE demo:
val headerAndBody = "Content-Type:application/smil\nContent-ID:0.smil\ncontent-transfer-encoding:binary\n<smil><head>"
val regex = "^[Cc]ontent-".r
val s1 = headerAndBody.lines.filter(x => regex.pattern.matcher(x).find()).toList
println(s1)
val s2 = headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
print (s2)
Results (the first is the fix, and the second shows that your second line of code fails):
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary)
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)
Your regexp should match all line but not only first sub-string.
val regex = "[Cc]ontent-.*".r

How to filter out alphanumeric strings in Scala using regular expression

I want to filter out alphanumeric and numeric words from my file. I'm working on Spark-Shell. These are the contents of my file sparktest.txt:
This is 1 file not 54783. Would you l1ke this file to be Writt3n to
HDFS?
Defining the file for collection:
scala> val myLines = sc.textFile("sparktest.txt")
Saving the line into an Array with words of length greater than 2:
scala> val myWords = myLines.flatMap(x => x.split("\\W+")).filter(x => x.length >2)
Defining a regular expression to use. I only want string that match "[A-Za-z]+":
scala> val regexpr = "[A-Za-z]+".r
Attempting to filter out the alphanumeric and numeric strings:
scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
This is where I'm stuck.
I want the result to look like this:
Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
You can actually do this in one transformation and filter the split arrays within your flatMap:
val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
When I run this in spark-shell, I see:
scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21
scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23
scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
You can use filter(x => regexpr.pattern.matcher(x).matches) or filter(_.matches("[A-Za-z]+"))

Regex parse a list of comma separated number ranges, and capture them in individual groups

I found the following regex for matching comma separated numbers or ranges of numbers:
val reg = """^(\d+(-\d+)?)(,\s*(\d+(-\d+)?))*$""".r
While this does match valid strings, I only get one String out of it, instead of a list of strings, each corresponding to one of the separated entries. E.g.
reg.findAllIn("1-2, 3").map(s => s""""$s"""").toList
Gives
List("1-2, 3")
But I want:
List("1-2", "3")
The following comes closer:
val list = "1-2, 3" match {
case Reg(groups # _*) => groups
case _ => Nil
}
list.map(s => s""""$s"""")
But it contains all sorts of 'garbage':
List("1-2", "-2", ", 3", "3", "null")
With findAllIn you should not try to match the entire string. It will split by the biggest continuos match it can find. Instead what you need is just a part of your regex:
val reg = """(\d+(-\d+)?)""".r
If you use this with findAllIn it will return what you need.
scala> val x = """(\d+(-\d+)?)""".r
x: scala.util.matching.Regex = (\d+(-\d+)?)
scala> x.findAllIn("1-2, 3").toList
res0: List[String] = List(1-2, 3)

Scala extract from list based on condition

I have a list of words as a list an I would like to extract words that are maybe of lengths between 5 and 10, I am using the following code but doesn't seem to work. Also i can use only val and not var.
val sentence = args(0)
val words = sentence.split(" ")
val fullsort = words.sortBy(w => w.length -> w)
val med = fullsort.map(x => if(x.length>3 && x.length<11) x)
val sentence = args(0)
val words = sentence.split(" ")
val results = words.filter(word => word.length >= 5 && word.length <= 10)
Try this
val sentence = args(0)
val words = sentence.split(" ")
val fullsort = words.sortBy(w => w.length -> w)
val med = fullsort collect {case x:String if (x.length >= 5 && x.length <= 10) => x}
Another alternative is to let a regex do more of the work for you:
val wordLimitRE = "\\b\\w{5,10}\\b".r
val wordIterator = wordLimitRE.findAllMatchIn(sentence).map {_.toString}
This particular regex starts with a word boundary pattern \b then a range limited match for a number of word characters \w{lower, upper} then finally another word boundary pattern \b
The method findAllMatchIn returns an Iterator[Regex.Match] for each match (matches don't overlap because of the word boundary patterns). The map {_.toString} returns an Iterator[String]