How to remove double quotes from keys in RDD and split JSON into two lines? - regex

I need to modify the data to give input to CEP system, my current data looks like below
val rdd = {"var":"system-ready","value":0.0,"objectID":"2018","partnumber":2,"t":"2017-08-25 11:27:39.000"}
I need output like
t = "2017-08-25 11:27:39.000
Check = { var = "system-ready",value = 0.0, objectID = "2018", partnumber = 2 }
I have to write RDD map operations to achieve this if anybody suggests better option welcome. colcount is the number of columns.
rdd.map(x => x.split("\":").mkString("\" ="))
.map((f => (f.dropRight(1).split(",").last.toString, f.drop(1).split(",").toSeq.take(colCount-1).toString)))
.map(f => (f._1, f._2.replace("WrappedArray(", "Check = {")))
.map(f => (f._1.drop(0).replace("\"t\"", "t"), f._2.dropRight(1).replace("(", "{"))) /
.map(f => f.toString().split(",C").mkString("\nC").replace(")", "}").drop(0).replace("(", "")) // replacing , with \n, droping (
.map(f => f.replace("\" =\"", "=\"").replace("\", \"", "\",").replace("\" =", "=").replace(", \"", ",").replace("{\"", "{"))

Scala's JSON parser seems to be a good choice for this problem:
import scala.util.parsing.json
rdd.map( x => {
JSON.parseFull(x).get.asInstanceOf[Map[String,String]]
})
This will result in an RDD[Map[String, String]]. You can then access the t field from the JSON, for example, using:
.map(dict => "t = "+dict("t"))

Related

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

Keep only numbers in a string / remove all non-numbers

I have a string column where I only need to the numbers from each string, e.g.
A-123 -> 123
456 -> 456
7-X89 -> 789
How can this be done in PowerQuery?
Add column. In custom column formula type this one-liner:
= Text.Select( [Column], {"0".."9"} )
where [Column] is a string column with a mix of digits and other characters. It extracts numbers only. The new column is still a text column, so you have to change the type.
Edit. If there are dots and minus characters:
= Text.Select( [Column1], {"0".."9", "-", "."} ))
Alternatively, you can transform the existing column:
= Table.TransformColumns( #"PreviousStepName" , {{"Column", each Text.Select( _ , {"0".."9","-","."} ) }} )
An alternative solution is to split the values on each number, and remove blanks from the resulting list.
This result can be used as a new list of delimiters to be used with function Splitter.SplitTextByEachDelimiter to split the original text again and combine the resulting list to the final result.
Explanation: Splitter.SplitTextByEachDelimiter first splits on the first delimiter in the list, then on the second and so on. Note that this function creates a function that must be called with the original string as parameter, so like S.S(delimiters)(string).
Example code:
let
Source = Table1,
NumbersOnly = Table.TransformColumns(Source,{{"String", (string) => Text.Combine(Splitter.SplitTextByEachDelimiter(List.Select(Text.SplitAny(string,"0123456789"), each _ <> ""))(string))}})
in
NumbersOnly
First, create a custom function in PowerQuery using New Query - From Other Sources -> Blank Query. Open the Advanced Editor and paste the following code:
(source) =>
let
NumbersOnly = (char) => if Character.ToNumber(char) >=48 and Character.ToNumber(char) < 58 then char else "",
Len = Text.Length(source),
Acc = List.Accumulate(
List.Generate( () => 0, each _ < Len, each _ + 1),
"",
(acc, index) => acc& NumbersOnly(Text.At(source, index))
),
AsNumber = Number.FromText(Acc)
in
AsNumber
Name this query NumbersOnly.
Now in your main query, add another calculated column where you call this NumbersOnly function with the source column, e.g.:
let
Source = Table.FromRecords({[text="A-123"], [text="456"], [text="7-X89"]}),
Result = Table.AddColumn(Source, "Values", each NumbersOnly([text]), Int64.Type)
in
Result

Saving partial spark DStream window to HDFS

I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
val counts = eegStreams(a).map(x => (math.round(x.toDouble), 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))
val sortedCounts = counts.map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap)
ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" + rdd.take(10).mkString("\n")))
sortedCounts.map(tuple => "%s,%s".format(tuple._1, tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))
I can print top 10 as above (commented).
I have also tried
sortedCounts.foreachRDD{ rdd => ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
but I get the following error. My Array is not serializable
15/01/05 17:12:23 ERROR actor.OneForOneStrategy:
org.apache.spark.streaming.StreamingContext
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
Can you try this?
sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...))
Note: I didn't test the snippet...
Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created.
The second version won't work b/c StreamingContext cannot be serialized in a closure.

Scala: List[Tuple3] to Map[String,String]

I've got a query result of List[(Int,String,Double)] that I need to convert to a Map[String,String] (for display in an html select list)
My hacked solution is:
val prices = (dao.getPricing flatMap {
case(id, label, fee) =>
Map(id.toString -> (label+" $"+fee))
}).toMap
there must be a better way to achieve the same...
How about this?
val prices: Map[String, String] =
dao.getPricing.map {
case (id, label, fee) => (id.toString -> (label + " $" + fee))
}(collection.breakOut)
The method collection.breakOut provides a CanBuildFrom instance that ensures that even if you're mapping from a List, a Map is reconstructed, thanks to the type annotation, and avoids the creation of an intermediary collection.
A little more concise:
val prices =
dao.getPricing.map { case (id, label, fee) => ( id.toString, label+" $"+fee)} toMap
shorter alternative:
val prices =
dao.getPricing.map { p => ( p._1.toString, p._2+" $"+p._3)} toMap

SubSonic3 Update Question

Is it possible to do something like this in SubSonic3?
_db.Update<Product>()
.Set("UnitPrice")
.EqualTo(UnitPrice + 5)
.Where<Product>(x=>x.ProductID==5)
.Execute();
I would need something lik this:
UPDATE Blocks
SET OrderId = OrderId - 1
WHERE ComponentId = 3
But in SubSonic3
I think you can here is a sample for demonstrating how you can use subsonic 3
// One thing you might not have seen with Linq To Sql is the ability to run Updates
//and Inserts, which I've always missed and have now implemented with SubSonic 3.0:
db.Update<Products>().Set(
x => x.Discontinued == false,
x => x.ReorderLevel == 100)
.Where(x=>x.Category==5)
.Execute();
db.Insert.Into<Region>(x => x.RegionID, x => x.RegionDescription)
.Values(6, "Hawaii")
.Execute();
and here a link to the full demonstration
i do it as a select
var model = ClassName.SingleOrDefault(x => x.id == 1);
model.name = "new name";
model.tel = " new telephone;
model.save();
done