regex replace string PySpark - regex

I have a column with string values like '{"phones":["phone1", "phone2"]}' and i would like to remove characters and result in a string like phone1, phone2. I am using a regex like
df.withColumn('Phones',
F.regexp_replace(F.split(F.col('input_phones'), ':').getItem(1), r'\}', ''))
which returns a string like '["phone1", "phone2"]'.
Is there a way to test different regex and how to exclude other special characters?

Your string ('{"phones":["phone1", "phone2"]}') looks like a json, and we can parse the same in pyspark using from_json.
data_sdf = spark.sparkContext.parallelize([('{"phones":["phone1", "phone2"]}',)]).toDF(['json_str'])
# json's schema
json_target_schema = StructType([
StructField('phones', ArrayType(StringType()), True)
])
data_sdf. \
withColumn('json_parsed', func.from_json(func.col('json_str'), json_target_schema)). \
select('json_str', 'json_parsed.*'). \
withColumn('phones_str', func.concat_ws(',', 'phones')). \
show(truncate=False)
# +-------------------------------+----------------+-------------+
# |json_str |phones |phones_str |
# +-------------------------------+----------------+-------------+
# |{"phones":["phone1", "phone2"]}|[phone1, phone2]|phone1,phone2|
# +-------------------------------+----------------+-------------+
Lets check the dataframe's schema to see the columns' data types
data_sdf. \
withColumn('json_parsed', func.from_json(func.col('json_str'), json_target_schema)). \
select('json_str', 'json_parsed', 'json_parsed.*'). \
withColumn('phones_str', func.concat_ws(',', 'phones')). \
printSchema()
# root
# |-- json_str: string (nullable = true)
# |-- json_parsed: struct (nullable = true)
# | |-- phones: array (nullable = true)
# | | |-- element: string (containsNull = true)
# |-- phones: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- phones_str: string (nullable = false)
# +-------------------------------+------------------+----------------+-------------+
# |json_str |json_parsed |phones |phones_str |
# +-------------------------------+------------------+----------------+-------------+
# |{"phones":["phone1", "phone2"]}|{[phone1, phone2]}|[phone1, phone2]|phone1,phone2|
# +-------------------------------+------------------+----------------+-------------+

Related

positive lookbehind in kotlin doesn't work in match

I'm iterating on this file:
[INFO] com.demo:communication:jar:3.5.0-SNAPSHOT
[INFO] +- com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
[INFO] | +- org.apache.logging.log4j:log4j-api:jar:2.7:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.7:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
[INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
[INFO] | | +- org.slf4j:slf4j-api:jar:1.7.21:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
I want to remove all the prefix on every line: "[INFO] " / "[INFO] +- " / "[INFO] | | - " etc
I'm using this function I wrote on every line in the file:
private fun extractDependency(raw: String): Dependency {
val uniqueDependencyRegex = Regex.fromLiteral("(?<=\\+- ).*")
val duplicateDependencyRegex = Regex.fromLiteral("(?<=\\().+?(?=\\))")
val projectRegex = Regex.fromLiteral("(?<=\\[INFO\\] ).*")
when {
uniqueDependencyRegex matches raw -> {
val matchResult = uniqueDependencyRegex.matchEntire(raw)
println(matchResult)
}
duplicateDependencyRegex matches raw -> {
val matchResult = duplicateDependencyRegex.matchEntire(raw)
println(matchResult)
}
projectRegex matches raw -> {
val matchResult = projectRegex.matchEntire(raw)
println(matchResult)
}
else -> {
//TODO - throw exception
}
}
return Dependency("test", "test", "test", "test")
}
I'm expecting it to work after I tested the regular expressions:
First Condition
Second Condition
Third Condition
The result I want is:
com.demo:communication:jar:3.5.0-SNAPSHOT
com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile
org.apache.logging.log4j:log4j-core:jar:2.7:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate
org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
org.slf4j:slf4j-api:jar:1.7.21:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate
You could either match [INFO] followed by a character class that will match any of the listed characters [| +\\(-], or match ) at the end of the string.
In the replacement use an empty string.
^\[INFO\][| +\\(-]+|\)$
With double escaped backslashes
^\\[INFO\\][| +\\\\(-]+|\\)$
regex demo
A bit more precise pattern could be repeatedly matching any of the occurring patterns like | or +- or \- and capture the content in group 1 between optional parenthesis. Then use the group in the replacement.
^\[INFO\](?:(?: +(?:\||\+-|\\-))+)? +\(?(.*?)\)?$
Regex demo

Matching string between two markers that are filepaths and contain special characters

I'm trying to write a ruby script that will return text between two other strings. The issues is that the two matching string contain special characters. Escaping the special characters is not solving the problem.
I've tried escaping special characters, different matching patterns, and providing variables with the matching strings without much luck.
I've also tested a simplified match by using only ODS and NAME as delimiters. That seemed to work.
####Example contents of logfile
#### 'aaaaaaaaa ODS | Filename = /tmp/bbbbbb | NAME = ccccc'
log_to_scan = 'logfile'
marker1 = 'ODS | FILENAME = /tmp/'
marker2 = ' | NAME'
contents = File.read(log_to_scan)
print contents.match(/ODS \| FILENAME = \/tmp\/(.*) \| NAME/m[1].strip
print contents.match(/marker1(.*)marker2/m)[1].strip
Given the sample contents above, I am expecting the output to be bbbbbb. However, I am getting either nothing or a NoMethod error. Not sure what else to true or what I'm mistake I'm making.
str = 'aaaaaaaaa ODS | Filename = /tmp/bbbbbb | NAME = ccccc'
marker1 = 'ODS | FILENAME = /tmp/'
marker2 = ' | NAME'
r = /(?<=#{Regexp.escape(marker1)}).*(?=#{Regexp.escape(marker2)})/i
#=> /(?<=ODS\ \|\ FILENAME\ =\ \/tmp\/).*(?=\ \|\ NAME)/i
str[r]
#=> "bbbbbb"
or
r = /#{Regexp.escape(marker1)}(.*)#{Regexp.escape(marker2)}/i
str[r,1]
#=> "bbbbbb"
or, if the string to be matched is known to be lower-case, or it is permissible to return that string downcased:
s = str.downcase
#=> "aaaaaaaaa ods | filename = /tmp/bbbbbb | name = ccccc"
m1 = marker1.downcase
#=> "ods | filename = /tmp/"
m2 = marker2.downcase
#=> " | name"
id1 = s.index(m1) + m1.size
#=> 32
id2 = s.index(m2, id1+1) - 1
#=> 37
str[id1..id2]
#=> "bbbbbb"
See Regexp::escape. In #1,
(?<=#{Regexp.escape(marker1)})
is a positive lookbehind, requiring marker1 to appear immediately before the match.
(?=#{Regexp.escape(marker2)})
is a positive lookahead, requiring marker2 to immediately follow the match.
In #3, I used the form of String#index that takes a second argument ("offset").
Your original expression is just fine, we would be slightly modifying it here, if there might be other additional spaces in your string input and it might work:
^.+?ODS(\s+)?\|(\s+)?FILENAME(\s+)?=(\s+)?\/tmp\/(.+?)(\s+)?\|(\s+)?NAME(\s+)?=(\s+)?(.+?)$
and our desired outputs are in these two capturing groups:
(.+?)
Test
re = /^.+?ODS(\s+)?\|(\s+)?FILENAME(\s+)?=(\s+)?\/tmp\/(.+?)(\s+)?\|(\s+)?NAME(\s+)?=(\s+)?(.+?)$/mi
str = 'aaaaaaaaa ODS | Filename = /tmp/bbbbbb | NAME = ccccc'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
Demo
How about String#scanf?
> require 'scanf'
> str = 'ODS | FILENAME = /tmp/ | NAME'
> str.scanf('ODS | FILENAME = %s | NAME')
=> ["/tmp/"]

regex of json string in data frame using spark scala

I am having trouble retrieving a value from a JSON string using regex in spark.
My pattern is:
val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"
My test string in a DF is
import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")
I am then trying to parse out the id value and add it to the df through a UDF like so:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
var subString = request match {
case patternWithResponseRegex(idextracted) => Array(idextracted)
case _ => Array("na")
}
subString
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.drop("patternMatchGroups")
So I want my df to look like
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|
However, when I try the above method, my match comes back as "null" despite working on regex101.com
Could anyone advise or suggest a different method? Thank you.
Following Krzysztof's solution, my table now looks like so:
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|
I created a new udf to trim the unnecessary characters and added it to the df:
def trimId = udf((idextracted: String) => {
val id = idextracted.drop(6).dropRight(1)
id
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.withColumn("id", trimId($"idextracted"))
.drop("patternMatchGroups", "idextracted")
The df now looks as desired. Thanks again Krzysztof!
When you're using pattern matching with regex, you're trying to match whole string, which obviously can't succeed. You should rather use findFirstMatchIn:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
patternWithResponseRegex.findFirstIn(request).map(Array(_)).getOrElse(Array("na"))
})
You're also creating your pattern in a very bizarre way unless you've got special use case for it. You could just do:
val pattern = """"id":"(.*?)""""

Extract words from a string column in spark dataframe

I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '#' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '#' it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: #always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: #always_nidhi,#YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("#\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.
You can use java RegEx to extract those words. Below is the working code.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
Output
+--------------------+
| UDF(words)|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
In Spark 3.1+ it's possible using regexp_extract_all
Test with your input:
import spark.implicits._
var df = Seq(
("#always_nidhi #YouTube no"),
("#always_nidhi"),
("no")
).toDF("text")
val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)
// +-------------------------+-----------------------+
// |text |Names |
// +-------------------------+-----------------------+
// |#always_nidhi #YouTube no|#always_nidhi, #YouTube|
// |#always_nidhi |#always_nidhi |
// |no | |
// +-------------------------+-----------------------+
array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.
I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:
select Words(status) as people from dataframe
Words is my UDF and status is my dataframe column.

Scala regular expression (xml parsing)

I'm parsing an xml file, that has nodes with text like this:
<img src="someUrl1"> American Dollar 1USD | 2,8567 | sometext
<img src="someUrl2"> Euro 1EUR | 3,9446 | sometext
<img src="someUrl3"> Japanese Jen 100JPY | 3,4885 | sometext
What I want to get is values like this:
American Dollar, USD, 2,8576
Euro, EUR, 3,9446
Japanese Jen, JPY, 3,4885
I wonder how could I write the regular expression for this. Scala has some weird regular expressions and I can't figure it out.
If I am understanding you correct, you just want to use regex to get your informations. In this case, you can use the Extractor functionality of Scala and do something like this:
scala> val RegexParser = """(.*) \d+([A-Z]+) \| (.*) \|.*""".r
RegexParser: scala.util.matching.Regex = (.*) \d+([A-Z]+) \| (.*) \|.*
scala> val RegexParser(name,shortname,value) = "American Dollar 1USD | 2,8567 | sometext"
name: String = American Dollar
shortname: String = USD
value: String = 2,8567
scala> val RegexParser(name,shortname,value) = "Euro 1EUR | 3,9446 | sometext"
name: String = Euro
shortname: String = EUR
value: String = 3,9446
scala> val RegexParser(name,shortname,value) = "Japanese Jen 100JPY | 3,4885 | sometext"
name: String = Japanese Jen
shortname: String = JPY
value: String = 3,4885
First, you create an Extractor based on a Regex-String. This can be done by calling r on a String (class StringOps to be exact). After that you can use this Extractor to read out all matched elements (name, shortname, value). In this blog post you will find a good explanation.