How to use the RegexMatcher in SparkNLP - regex

Here is the case. I want to run SparkNLP on Jupyterlab with Scala kernel. I want to use the RegexMatcher annotation. I saved the pattern in a file named patterns.txt on s3 bucket. And I tried the implementation below
import com.johnsnowlabs.nlp.util.io.ExternalResource
import com.johnsnowlabs.nlp.util.io.ReadAs.LINE_BY_LINE
val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val regexmatcher = new RegexMatcher().
setInputCols(Array("document")).
setOutputCol("match").
setStrategy("MATCH_ALL").
setRules(ExternalResource("s3://bucket_name/patterns.txt", LINE_BY_LINE, Map("format" -> "text", "delimiter" -> " ")))
val pipeline_regex = new Pipeline().setStages(Array(document, regexmatcher))
val regex_match = pipeline_regex.fit(dev_data)
regex_match.transform(dev_data).select('match).show(false)
However, it seems thit doesn't work at all, and patterns.txt are not used. How to fix it.

Related

Unable to capture required string from text file using Groovy - Jmeter JSR223

I need to parse a text file testresults.txt and capture serial number and then write the captured serial number onto separate text file called serialno.txt using groovy Jmeter JSR223 post processor.
Below code is not working. It didn't get into the while loop itself. Kindly help.
import java.util.regex.Pattern
import java.util.regex.Matcher
String filecontent = new File("C:/device/resources/testresults.txt").text
def regex = "SerialNumber\" value=\"(.+)\""
java.util.regex.Pattern p = java.util.regex.Pattern.compile(regex)
java.util.regex.Matcher m = p.matcher(filecontent)
File SN = new File("C:/device/resources/serialno.txt")
while(m.find()) {
SN.write m.group(1)
}
If your code doesn't enter the loop it means that there are no matches so you need to amend your regular expression, you can use i.e. Regex101 website for experiments
Given the following content of the testresults.txt file:
SerialNumber" value="foo"
SerialNumber" value="bar"
SerialNumber" value="baz"
your code works fine.
For the time being I can only suggest using match operator to make your code more "groovy"
def source = new File('C:/device/resources/testresults.txt').text
def matches = (source =~ 'SerialNumber" value="(.+?)"')
matches.each { match ->
new File('C:/device/resources/serialno.txt') << match[1] << System.getProperty('line.separator')
}
Demo:
More information: Apache Groovy - Why and How You Should Use It

Regex from Python to Kotlin

I have a question about Regular Expression (Regex) and I really newbie in this. I found a tutorial a Regex written in Python to delete the data and replace it with an empty string.
This is the code from Python:
import re
def extract_identity(data, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
data (dict): The dictionary with data specific to this type of event.
context (google.cloud.functions.Context): The Cloud Functions event
metadata.
"""
import base64
import json
import urllib.parse
import urllib.request
if 'data' in data:
strjson = base64.b64decode(data['data']).decode('utf-8')
text = json.loads(strjson)
text = text['data']['results'][0]['description']
lines = text.split("\n")
res = []
for line in lines:
line = re.sub('gol. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan', '', line, flags=re.IGNORECASE)
line = line.replace(":","").strip()
if line != "":
res.append(line)
p = {
"province": res[0],
"city": res[1],
"id": res[2],
"name": res[3],
"birthdate": res[4],
}
print('Information extracted:{}'.format(p))
In the above function, information extraction is done by removing all e-KTP labels with regular expressions.
This is the sample of e-KTP:
And this is the result after scanning that e-KTP using the python code:
Information extracted:{'province': 'PROVINSI JAWA TIMUR', 'city': 'KABUPATEN BANYUWANGI', 'id': '351024300b730004', 'name': 'TUHAN', 'birthdate': 'BANYUWANGI, 30-06-1973'}
This is the full tutorial from the above code.
And then my question is, can we use Regex in Kotlin to remove the label from the result of e-KTP like in python code? Because I try some logic that I understand it does not remove the label of e-KTP. My code in Kotlin like this:
....
val lines = result.text.split("\n")
val res = mutableListOf<String>()
Log.e("TAG LIST STRING", lines.toString())
for (line in lines) {
Log.e("TAG STRING", line)
line.matches(Regex("gol. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan"))
line.replace(":","")
if (line != "") {
res.add(line)
}
Log.e("TAG RES", res.toString())
}
Log.e("TAG INSERT", res.toString())
tvProvinsi.text = res[0]
tvKota.text = res[1]
tvNIK.text = res[2]
tvNama.text = res[3]
tvTgl.text = res[4]
....
And this is the result of my code:
TAG LIST STRING: [PROVINSI JAWA BARAP, KABUPATEN TASIKMALAYA, NIK 320625XXXXXXXXXX, BRiEAFAUZEROMARA, Nama, TempatTgiLahir, Jenis keiamir, etc]
TAG INSERT: [PROVINSI JAWA BARAP, KABUPATEN TASIKMALAYA, NIK 320625XXXXXXXXXX, BRiEAFAUZEROMARA, Nama, TempatTgiLahir, Jenis keiamir, etc]
The label still exists, It's possible to remove a label using Regex or something in Kotlin like in Python?
The point is to use kotlin.text.replace with a Regex as the search argument. For example:
text = text.replace(Regex("""<REGEX_PATTERN_HERE>"""), "<REPLACEMENT_STRING_HERE>")
You may use
line = line.replace(Regex("""(?i)gol\. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan"""), "")
Note that (?i) at the start of the pattern is a quick way to make the whole pattern case insensitive.
Also, when you need to match a . with a regex you need to escape it. Since a backslash can be coded in several ways and people often fail to do it correctly, it is always recommended to define regex patterns within raw string literals, in Kotlin, you may use the triple-double-quoted string literals, i.e. """...""" where each \ is treated as a literal backslash that is used to form regex escapes.

Regex on io.Text RDD using scala

I have a problem. I need to extract some data from a file like this:
(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>..
) etc...
This file was generated using:
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}
I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:
val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))
I also try with these:
".*<title>([A-Za-z]+)</title>.*"
And using this:
val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)
I create the .jar using sbt and running with spark-submit.
BTW, using spark-shell it works :S
I need your help please. Thanks.
You could use built-in Scala support for XML. Something like
import scala.xml._
rdd.map(x => (XML.loadString(x._2) \ "title").text)

Kotlin Regex named groups support

Does Kotlin have support for named regex groups?
Named regex group looks like this: (?<name>...)
According to this discussion,
This will be supported in Kotlin 1.1.
https://youtrack.jetbrains.com/issue/KT-12753
Kotlin 1.1 EAP is already available to try.
"""(\w+?)(?<num>\d+)""".toRegex().matchEntire("area51")!!.groups["num"]!!.value
You'll have to use kotlin-stdlib-jre8.
As of Kotlin 1.0 the Regex class doesn't provide a way to access matched named groups in MatchGroupCollection because the Standard Library can only employ regex api available in JDK6, that doesn't have support for named groups either.
If you target JDK8 you can use java.util.regex.Pattern and java.util.regex.Matcher classes. The latter provides group method to get the result of named-capturing group match.
As of Kotlin 1.4, you need to cast result of groups to MatchNamedGroupCollection:
val groups = """(\w+?)(?<num>\d+)""".toRegex().matchEntire("area51")!!.groups as? MatchNamedGroupCollection
if (groups != null) {
println(groups.get("num")?.value)
}
And as #Vadzim correctly noticed, you must use kotlin-stdlib-jdk8 instead of kotlin-stdlib:
dependencies {
implementation "org.jetbrains.kotlin:kotlin-stdlib-jdk8"
}
Here is a good explanation about it
The above answers did not work for me, what did work however was using the following method:
val pattern = Pattern.compile("""(\w+?)(?<num>\d+)""")
val matcher = pattern.matcher("area51")
while (matcher.find()) {
val result = matcher.group("num")
}
kotlin
fun regex(regex: Regex, input: String, group: String): String {
return regex
.matchEntire(input)!!
.groups[group]!!
.value
}
#Test
fun regex() {
// given
val expected = "s3://asdf/qwer"
val pattern = "[\\s\\S]*Location\\s+(?<s3>[\\w/:_-]+)[\\s\\S]*"
val input = """
...
...
Location s3://asdf/qwer
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
""".trimIndent()
val group = "s3"
// when
val actual = CommonUtil.regex(pattern.toRegex(), input, group)
// then
assertEquals(expected, actual)
}

Task not serializable - Regex

i have a movie which has a title. In this title is the year of the movie like "Movie (Year)". I want to extract the Year and i'm using a regex for this.
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
When executing the last command i get the following Error:
java.io.NotSerializableException: org.apache.spark.SparkConf
Running in the Spark/Scala REPL, with this SparkContext:
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
As Dean explained, the reason of the problem is that the REPL creates a class out of the code added to the REPL and, in this case, the other variables in the same context are being "pulled" in the closure by the regex declaration.
Given the way you're creating the context, a simple way to avoid that serialization issue would be to declare the SparkConf and SparkContext transient:
#transient val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
#transient val sc = new SparkContext(conf)
You don't even need to recreate the spark context in the REPL for the only purpose of connecting to Cassandra:
spark-shell --conf spark.cassandra.connection.host=localhost
You probably have this code in a larger Scala class or object (a type), right? If so, in order to serialize the regexYear, the whole enclosing type gets serialized, but you probably have the SparkConf defined in that type.
This is a very common and confusing problem and efforts are underway to prevent it, given the constraints of the JVM and languages on top of it, like Java.
The solution (for now) is to put regexYear inside a method or another object:
object MyJob {
def main(...) = {
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
...
}
}
or
...
object small {
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
}
Hope this helps.
Try passing in the cassandra option on the command line for spark-shell like this:
spark-shell [other options] --conf spark.cassandra.connection.host=localhost
And that way you won't have to recreate the SparkContext -- you can use the SparkContext (sc) that gets instantiated automatically with spark-shell.