Number of lines containing a substring in a DataFrame - regex

I tried this solution to test if a string in substring:
val reg = ".*\\[CS_RES\\].*".r
reg.findAllIn(my_DataFrame).length
But it is not working because I can't apply findAllIn to a Dataframe.
I tried this second solution, I converted my DataFrame to RDD:
val rows: RDD[Row] = myDataFrame.rdd
val processedRDD = rows.map{
str =>
val patternReg = ".*\\[CS_RES\\].*".r
val result = patternReg.findAllIn(str).length
(str, result)
}
it displays an error:
<console>:69: error: type mismatch;
found : org.apache.spark.sql.Row
required: CharSequence
val result = patternReg.findAllIn(str).length
How can I apply a Regex on a DataFrame scala in the first solution to compute the number of the lines that contain the string [CS_RES]
or if someone have a solution for the second solution ?

You can use regexp_extract function to filter and count the lines. For example:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
private val session: SparkSession = ...
import session.implicits._
val myDataFrame = Seq(
(1L, "abc"),
(2L, "def"),
(3L, "a[CS_RES]b"),
(4L, "adg")
).toDF("id", "text")
val resultRegex = myDataFrame.where(regexp_extract($"text", "\\[CS_RES\\]", 0).notEqual("")).count()
println(resultRegex) // outputs 1
The idea is: if the first group (i=0) returned by regexp_extract is not an empty string, the substring is found. The invocation of count() returns the total number of those strings.
But if you need to find only exact matches of substrings, the solution can be simplified by using locate function:
val resultLocate = myDataFrame.where(locate("[CS_RES]", $"text") > 0).count()
println(resultLocate) // outputs 1

import org.apache.spark.sql.functions.udf
val reg = ".*\\[CS_RES\\].*".r
val contains=udf((s:String)=>reg.findAllIn(s).length >0)
val cnt = df.select($"summary").filter(contains($"summary")).count()

Related

transform string scala in an elegant way

I have the following input string: val s = 19860803 000000
I want to convert it to 1986/08/03
I tried this s.split(" ").head, but this is not complete
is there any elegant scala coding way with regex to get the expected result ?
You can use a date like pattern using 3 capture groups, and match the following space and the 6 digits.
In the replacement use the 3 groups in the replacement with the forward slashes.
val s = "19860803 000000"
val result = s.replaceAll("^(\\d{4})(\\d{2})(\\d{2})\\h\\d{6}$", "$1/$2/$3")
Output
result: String = 1986/08/03
i haven't tested this, but i think the following will work
val expr = raw"(\d{4})(\d{2})(\d{2}) (.*)".r
val formatted = "19860803 000000" match {
case expr(year,month,day,_) =>. s"$year/$month/$day"
}
scala docs have a lot of good info
https://www.scala-lang.org/api/2.13.6/scala/util/matching/Regex.html
An alternative, without a regular expression, by using slice and take.
val s = "19860803 000000"
val year = s.take(4)
val month = s.slice(4,6)
val day = s.slice(6,8)
val result = s"$year/$month/$day"
Or as a one liner
val result = Seq(s.take(4), s.slice(4,6), s.slice(6,8)).mkString("/")

Extract words from a string column in spark dataframe

I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '#' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '#' it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: #always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: #always_nidhi,#YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("#\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.
You can use java RegEx to extract those words. Below is the working code.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
Output
+--------------------+
| UDF(words)|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
In Spark 3.1+ it's possible using regexp_extract_all
Test with your input:
import spark.implicits._
var df = Seq(
("#always_nidhi #YouTube no"),
("#always_nidhi"),
("no")
).toDF("text")
val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)
// +-------------------------+-----------------------+
// |text |Names |
// +-------------------------+-----------------------+
// |#always_nidhi #YouTube no|#always_nidhi, #YouTube|
// |#always_nidhi |#always_nidhi |
// |no | |
// +-------------------------+-----------------------+
array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.
I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:
select Words(status) as people from dataframe
Words is my UDF and status is my dataframe column.

Jmeter Regular expression match number

I have two values to correlate and I am able to capture them in two parameters successfully. I am taking random values using -1 in match number, but I actually wanted in a way like, let's say my first value randomly take the match number as 7 and I want my second value also should take the same match num as 7.
Please help me how I can simulate this .
Unfortunately, (as you've discovered), JMeter determines the 'random' independently. What you'll need to do is capture each potential value (with a -1) for both of var1 and var2. Then after your Regexes, add a Beanshell Postprocessor that gets a random number n, then picks the nth var1 and var2:
String random_number = Integer(random.nextInt(vars.get("var1_name_matchNr"))).toString;
vars.put("var1_name_chosen",vars.get("var1_name_" + random_number));
vars.put("var2_name_chosen",vars.get("var2_name_" + random_number));
If I understood correctly, you want to extract random regex value, and put it into 2 variables. If so, I would suggest doing something like...
After you get random regex value, add beanshell in which you will paste value you got with regex into the second variable.
So if your variable in regex is "foo1", just add beanshell sampler with:
vars.put("foo2", vars.get("foo1"));
EDIT:
This would be better as Java sampler, but I think it should work in BeanShell sampler as well.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.jmeter.samplers.SampleResult;
import org.apache.jmeter.threads.JMeterContextService;
import java.util.ArrayList;
import java.util.Random;
String previousResponse = JMeterContextService.getContext()
.getPreviousResult().getResponseDataAsString();
String locationLinkRegex = "\"locationId\": (.+?),";
String myLocationId = RegexMethod(previousResponse, locationLinkRegex,
true);
String myLocationLink = RegexMethod(
previousResponse,
"\"locationId\": ".concat(myLocationId).concat(
", \"locationLink\":(.+?))\""), false);
JMeterContextService.getContext().getVariables()
.put("locationId", myLocationId);
JMeterContextService.getContext().getVariables()
.put("locationLink", myLocationLink);
private static String RegexMethod(String response, String regex,
Boolean random) {
Random ran = new Random();
String result = "No matcher!";
ArrayList<String> allMatches = new ArrayList<String>();
allMatches = null;
if (random) {
Matcher m = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
while (m.find()) {
allMatches.add(m.group());
}
result = allMatches.get(ran.nextInt(allMatches.size()));
} else {
Matcher m = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
m.find();
result = m.group(1);
}
return result;
}
Exception handling needs to be implemented as well...
EDIT2:
And the Regex-method as recursive (returns both values as CSV, and can be use only if locationId is unique):
private static String RegexMethod(String response, String regex) {
Random ran = new Random();
String result = "No matcher!";
List<String> allMatches = new ArrayList<String>();
// Find LocationId:
Matcher m1 = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
while (m1.find()) {
allMatches.add(m1.group());
}
result = allMatches.get(ran.nextInt(allMatches.size())).concat(",");
// Find LocationLink and return the CSV string:
return result += RegexMethod(response, "\"locationId\": "
.concat(result.substring(result.length()-1)).concat(", \"locationLink\":(.+?))\""));
}

apache-spark regex extract words from rdd

I try to extract words from a textfile.
Textfile:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
The following works well:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
But I want to extract the words for every line.
If i type
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
i get
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
What am I doing wrong?
Thanks in advance
Thank you for your Answer.
The goal was to count the occourence of words in a pos/neg-wordlist.
Seems this works:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))
Your problem is not exactly Apache Spark, your first map will make you handle a line, but your flatMap on that line will make you iterate on the characters in this line String. So Spark or not, your code won't work, for example in a Scala REPL :
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
So if you really want, using your regexp, all the words in your line, just use flatMap once :
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
Regards,

Detect and transform numbers in a string using regular expressions

How can I use a regular expression and matching to replace contents of a string? In particular I want to detect integer numbers and increment them. Like so:
val y = "There is number 2 here"
val p = "\\d+".r
def inc(x: String, c: Int): String = ???
assert(inc(y, 1) == "There is number 3 here")
Using replaceAllIn with a replacement function is one convenient way to write this:
val y = "There is number 2 here"
val p = "-?\\d+".r
import scala.util.matching.Regex.Match
def replacer(c: Int): Match => String = {
case Match(i) => (i.toInt + c).toString
}
def inc(x: String, c: Int): String = p.replaceAllIn(x, replacer(c))
And then:
scala> inc(y, 1)
res0: String = There is number 3 here
Scala's Regex provides a handful of useful tools like this, including a replaceSomeIn that takes a partial function, etc.