Loading csv files with wildcards to match a pattern - regex

I have 24 csv files that contain 0 to 23 in there name example hyper01.csv , hyper02.csv,....,hyper23.csv . But i just want to load files from 08 to 15 using wildcards
currently i am using folder_name/*{08-15} but its not working i am using spark

if you are using scala you could try this:
def getPaths(dir: String): List[String] = {
val file = new File(dir)
file.listFiles.filter(_.isFile)
.filter(_.getName.startsWith("hyper"))
.filter(s=>{
val index = s.getName.replace("hyper", "")
.replaceFirst("0", "")
.replaceAll("\\.(.*)","")
.toInt
index >= 8 && index <= 15
})
.map(_.getPath).toList
}
val filesDirectory = "C:\\Users\\user\\hyper"
getPaths(filesDirectory).foreach(println(_))
val df = spark.read.csv(getPaths(filesDirectory):_*)
output:
C:\Users\user\hyper\hyper08.csv
C:\Users\user\hyper\hyper15.csv

Scala small function can be used for get required files:
def getDirs(root: String, start: Int, end: Int): Seq[String] = {
(start to end).map(v => f"$root/hyper$v%02d.csv")
}
// usage
spark.read.parquet(getDirs("folder_name", 8, 15): _*)

Related

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

match a timestamp based on regex pattern matching scala

I wrote the following code :
val reg = "([\\d]{4})-([\\d]{2})-([\\d]{2})(T)([\\d]{2}):([\\d]{2})".r
val dataExtraction: String => Map[String, String] = {
string: String => {
string match {
case reg(year, month, day, symbol, hour, minutes) =>
Map(YEAR -> year, MONTH -> month, DAY -> day, HOUR -> hour)
case _ => Map(YEAR -> "", MONTH -> "", DAY -> "", HOUR -> "")
}
}
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val HOUR = "HOUR"
This function is supposed to be applied to strings having the following format: 2018-08-22T19:10:53.094Z
When I call the function :
dataExtractions("2018-08-22T19:10:53.094Z")
Your pattern, for all its deficiencies, does work. You just have to unanchor it.
val reg = "([\\d]{4})-([\\d]{2})-([\\d]{2})(T)([\\d]{2}):([\\d]{2})".r.unanchored
. . .
dataExtraction("2018-08-22T19:10:53.094Z")
//res0: Map[String,String] = Map(YEAR -> 2018, MONTH -> 08, DAY -> 22, HOUR -> 19)
But the comment from #CAustin is correct, you could just let the Java LocalDateTime API handle all the heavy lifting.
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter._
val dt = LocalDateTime.parse("2018-08-22T19:10:53.094Z", ISO_DATE_TIME)
Now you have access to all the data without actually saving it to a Map.
dt.getYear //res0: Int = 2018
dt.getMonthValue //res1: Int = 8
dt.getDayOfMonth //res2: Int = 22
dt.getHour //res3: Int = 19
dt.getMinute //res4: Int = 10
dt.getSecond //res5: Int = 53
Your pattern matches only strings that look exactly like yyyy-mm-ddThh:mm, while the one you are testing against has milliseconds and a Z at the end.
You can append .* at the end of your pattern to cover strings that have additional characters at the end.
In addition, let me show you a more idiomatic way of writing your code:
// Create a type for the data instead of using a map.
case class Timestamp(year: Int, month: Int, day: Int, hour: Int, minutes: Int)
// Use triple quotes to avoid extra escaping.
// Don't capture parts that you will not use.
// Add .* at the end to account for milliseconds and timezone.
val reg = """(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}).*""".r
// Instead of empty strings, use Option to represent a value that can be missing.
// Convert to Int after parsing.
def dataExtraction(str: String): Option[Timestamp] = str match {
case reg(y, m, d, h, min) => Some(Timestamp(y.toInt, m.toInt, d.toInt, h.toInt, min.toInt))
case _ => None
}
// It works!
dataExtraction("2018-08-22T19:10:53.094Z") // => Some(Timestamp(2018,8,22,19,10))

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

How to convert list of list to simple list by removing duplicate values using scala?

I have following list -
List(List(
List(((groupName,group1),(tagMember,["192.168.20.30","192.168.20.20","192.168.20.21"]))),
List(((groupName,group1),(tagMember,["192.168.20.30"]))),
List(((groupName,group1),(tagMember,["192.168.20.30","192.168.20.20"])))))
I want to convert it to -
List((groupName, group1),(tagMember,["192.168.20.30","192.168.20.20","192.168.20.21"]))
I tried to use .flatten but unable to form desired output.
How do I get above mentioned output using scala??
I had to make some changes to your input to make it valid.
Input List:
val ll = List(List(
List((("groupName","group1"),("tagMember", List("192.168.20.30","192.168.20.20","192.168.20.21")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20"))))
))
Code below works if the group, and tagMember are the same across all the elements in the list
def getUniqueIpsConstantGroupTagMember(inputList: List[List[List[((String, String), (String, List[String]))]]]) = {
// List[((String, String), (String, List[String]))]
val flattenedList = ll.flatten.flatten
if (flattenedList.size > 0) {
val group = flattenedList(0)._1
val tagMember = flattenedList(0)._2._1
val ips = flattenedList flatMap (_._2._2)
((group), (tagMember, ips.distinct))
}
else List()
}
println(getUniqueIpsConstantGroupTagMember(ll))
Output:
((groupName,group1),(tagMember,List(192.168.20.30, 192.168.20.20, 192.168.20.21)))
Now, let's assume you could have different groupNames.
Sample input:
val listWithVariableGroups = List(List(
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20","192.168.20.21")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30")))),
List((("groupName","group1"),("tagMember",List("192.168.20.30","192.168.20.20")))),
List((("groupName","group2"),("tagMember",List("192.168.20.30","192.168.20.10"))))
))
The following code should work.
def getUniqueIpsForMultipleGroups(inputList: List[List[List[((String, String), (String, List[String]))]]]) = {
val flattenedList = inputList.flatten.flatten
// Map[(String, String),List[(String, List[String])]]
val groupedByGroupNameId = flattenedList.groupBy(p => p._1) map {
case (key, value) => (key, ("tagMember", extractUniqueTagIps(value)))
}
groupedByGroupNameId
}
def extractUniqueTagIps(list: List[((String, String), (String, List[String]))]) = {
val ips = list flatMap (_._2._2)
ips.distinct
}
getUniqueIpsForMultipleGroups(listWithVariableGroups).foreach(println)
Output:
((groupName,group1),(tagMember,List(192.168.20.30, 192.168.20.20, 192.168.20.21)))
((groupName,group2),(tagMember,List(192.168.20.30, 192.168.20.10)))