Regex / subString to extract all matching patterns / groups - regex

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?

If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}

Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}

Related

M & Power Query: How to use the $Skip ODATA expression within a loop?

Good afternoon all,
I'm trying to call all of the results within an API that has:
6640 total records
100 records per page
67 pages of results (total records / records per page)
This is an ever growing list so I've used variables to create the above values.
I can obviously use the $Skip ODATA expression to get any one of the 67 pages by adding the expression to the end of the URL like so (which would skip the first 100, therefore returning the 2nd page:
https://psa.pulseway.com/api/servicedesk/tickets/?$Skip=100
What I'm trying to do though is to create a custom function that will loop through each of the 67 calls, changing the $Skip value by an increment of 100 each time.
I thought I'd accomplished the goal with the below code:
let
Token = "Token",
BaseURL = "https://psa.pulseway.com/api/",
Path = "servicedesk/tickets/",
RecordsPerPage = 100,
CountTickets = Json.Document(Web.Contents(BaseURL,[Headers = [Authorization="Bearer " & Token],RelativePath = Path & "count"])),
TotalRecords = CountTickets[TotalRecords],
GetJson = (Url) =>
let Options = [Headers=[ #"Authorization" = "Bearer " & Token ]],
RawData = Web.Contents(Url, Options),
Json = Json.Document(RawData)
in Json,
GetPage = (Index) =>
let Skip = "$Skip=" & Text.From(Index * RecordsPerPage),
URL = BaseURL & Path & "?" & Skip,
Json = GetJson(URL)
in Json,
TotalPages = Number.RoundUp(TotalRecords / RecordsPerPage),
PageIndicies = {0.. TotalPages - 1},
Pages = List.Transform(PageIndicies, each GetPage(_))
in
Pages
I got all happy when it successfully made the 67 API calls and combined the results into a list for me to load in to a Power Query table, however what I'm actually seeing is the first 100 records repeated 67 times.
That tells me that my GetPage custom function which handles the $Skip value isn't changing and is stuck on the first one. To make sure the Skip index was generating them properly I duplicated the query and changed the code to load in the $Skip values and see what they are, expecting them all to be $Skip=0, what I see though is the correct $Skip values as below:
Image showing correct Skip values
It seems everything is working as it should be, only I'm only getting the first page 67 times.
I've made a couple of posts on other community site around this issue before but I realise the problem I was (poorly) describing was far too broad to get any meaningful assistance. I think now I've gotten to the point where I understand what my own code is doing and have really zoomed in to the problem - I just don't know how to fix it when I'm at the final hurdle...
Any help/advice would be massively appreciated. Thank you.
Edit: Updated following #RicardoDiaz answer.
let
// Define base parameters
Filter = "",
Path = "servicedesk/tickets/",
URL = "https://psa.pulseway.com/api/",
Token = "Token",
Limit = "100",
// Build the table based on record start and any filters
GetEntityRaw = (Filter as any, RecordStart as text, Path as text) =>
let
Options = [Headers=[ #"Authorization" = "Bearer " & Token ]],
URLbase = URL & Path & "?bearer=" & Token & "&start=" & RecordStart & "&limit=" & Text.From(Limit),
URLentity = if Filter <> null then URLbase & Filter else URLbase,
Source = Json.Document(Web.Contents(URLentity, Options)),
Result = Source[Result],
toTable = Table.FromList(Result, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
toTable,
// Recursively call the build table function
GetEntity = (optional RecordStart as text) as table =>
let
result = GetEntityRaw(Filter, RecordStart, Path),
nextStart = Text.From(Number.From(RecordStart) + Limit),
nextTable = Table.Combine({result, #GetEntity(nextStart)}),
check = try nextTable otherwise result
in
check,
resultTable = GetEntity("0")
in
resultTable
As I couldn't test your code, it's kind of hard to provide you a concrete answer.
Said that, please review the generic code I use to connect to an api and see if you can find where yours is not working
EDIT: Changed api_record_limit type to number (removed the quotation marks)
let
// Define base parameters
api_url_filter = "",
api_entity = "servicedesk/tickets/",
api_url = "https://psa.pulseway.com/api/",
api_token = "Token",
api_record_limit = 500,
// Build the table based on record start and any filters
fx_api_get_entity_raw = (api_url_filter as any, api_record_start as text, api_entity as text) =>
let
api_url_base = api_url & api_entity & "?api_token=" & api_token & "&start=" & api_record_start & "&limit=" & Text.From(api_record_limit),
api_url_entity = if api_url_filter <> null then api_url_base & api_url_filter else api_url_base,
Source = Json.Document(Web.Contents(api_url_entity)),
data = Source[data],
toTable = Table.FromList(data, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
toTable,
// Recursively call the build table function
fxGetEntity = (optional api_record_start as text) as table =>
let
result = fx_api_get_entity_raw(api_url_filter, api_record_start, api_entity),
nextStart = Text.From(Number.From(api_record_start) + api_record_limit),
nextTable = Table.Combine({result, #fxGetEntity(nextStart)}),
check = try nextTable otherwise result
in
check,
resultTable = fxGetEntity("0"),
expandColumn = Table.ExpandRecordColumn(
resultTable,
"Column1",
Record.FieldNames(resultTable{0}[Column1]),
List.Transform(Record.FieldNames(resultTable{0}[Column1]), each _)
)
in
expandColumn
QUESTION TO OP:
Regarding this line:
Result = Source[Result],
Does the json return a field called result instead of data?

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

Spark - remove special characters from rows Dataframe with different column types

Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type
so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala).
And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this?
Thanks
One option is to define two udfs to handle string type column and Map type column separately:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1) Udf to handle string type columns:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2) Udf to handle Map type columns:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3) Apply udfs to corresponding columns to clean it up:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

Unique list from dynamic range table with possible blanks

I have an Excel table in sheet1 in which column A:
Name of company Company 1 Company 2 Company
3 Company 1 Company 4 Company 1 Company
3
I want to extract a unique list of company names to sheet2 also in column A. I can only do this with help of a helper column if I dont have any blanks between company names but when I do have I get one more company which is a blank.
Also, I've researched but the example was for non-dynamic tables and so it doesn't work because I don't know the length of my column.
I want in Sheet2 Column A:
Name of company Company 1 Company 2 Company 3
Company 4
Looking for the solution that requires less computational power Excel or Excel-VBA. The final order which they appear in sheet2 don't really matter.
Using a slight modification to Recorder-generated code:
Sub Macro1()
Sheets("Sheet1").Range("A:A").Copy Sheets("Sheet2").Range("A1")
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=Range("A2:A" & Rows.Count) _
, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange Range("A2:A" & Rows.Count)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
Sample Sheet1:
Sample Sheet2:
The sort removes the blanks.
EDIT#1:
If the original data in Sheet1 was derived from formulas, then using PasteSpecial will remove unwanted formula copying. There is also a final sweep for empty cells:
Sub Macro1_The_Sequel()
Dim rng As Range
Sheets("Sheet1").Range("A:A").Copy
Sheets("Sheet2").Range("A1").PasteSpecial Paste:=xlPasteValues
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
Set rng = Sheets("Sheet2").Range("A2:A" & Rows.Count)
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=rng, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
Call Kleanup
End Sub
Sub Kleanup()
Dim N As Long, i As Long
With Sheets("Sheet2")
N = .Cells(Rows.Count, "A").End(xlUp).Row
For i = N To 1 Step -1
If .Cells(i, "A").Value = "" Then
.Cells(i, "A").Delete shift:=xlUp
End If
Next i
End With
End Sub
All of these answers use VBA. The easiest way to do this is to use a pivot table.
First, select your data, including the header row, and go to Insert -> PivotTable:
Then you will get a dialog box. You don't need to select any of the options here, just click OK. This will create a new sheet with a blank pivot table. You then need to tell Excel what data you're looking for. In this case, you only want the Name of company in the Rows section. On the right-hand side of Excel you will see a new section named PivotTable Fields. In this section, simply click and drag the header to the Rows section:
This will give a result with just the unique names and an entry with (blank) at the bottom:
If you don't want to use the Pivot Table further, simply copy and paste the result rows you're interested in (in this case, the unique company names) into a new column or sheet to get just those without the pivot table attached. If you want to keep the pivot table, you can right click on Grand Total and remove that, as well as filter the list to remove the (blank) entry.
Either way, you now have your list of unique results without blanks and it didn't require any formulas or VBA, and it took relatively few resources to complete (far fewer than any VBA or formula solution).
Here's another method using Excel's built-in Remove Duplicates feature, and a programmed method to remove the blank lines:
EDIT
I have deleted the code using the above methodology as it takes too long to run. I have replaced it with a method that uses VBA's collection object to compile a unique list of companies.
The first method, on my machine, took about two seconds to run; the method below: about 0.02 seconds.
Sub RemoveDups()
Dim wsSrc As Worksheet, wsDest As Worksheet
Dim rRes As Range
Dim I As Long, S As String
Dim vSrc As Variant, vRes() As Variant, COL As Collection
Set wsSrc = Worksheets("sheet1")
Set wsDest = Worksheets("sheet2")
Set rRes = wsDest.Cells(1, 1)
'Get the source data
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
'Collect unique list of companies
Set COL = New Collection
On Error Resume Next
For I = 2 To UBound(vSrc, 1) 'Assume Row 1 is the header
S = CStr(Trim(vSrc(I, 1)))
If Len(S) > 0 Then COL.Add S, S
Next I
On Error GoTo 0
'Populate results array
ReDim vRes(0 To COL.Count, 1 To 1)
'Header
vRes(0, 1) = vSrc(1, 1)
'Companies
For I = 1 To COL.Count
vRes(I, 1) = COL(I)
Next I
'set results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1)
'Write the results
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
'Uncomment the below line if you want
'.Sort key1:=.Columns(1), order1:=xlAscending, MatchCase:=False, Header:=xlYes
End With
End Sub
NOTE: You wrote you didn't care about the order, but if you want to Sort the results, that added about 0.03 seconds to the routine.
With two sheets named 1 and 2
Inside sheet named: 1
+----+-----------------+
| | A |
+----+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | |
| 5 | Company 3 |
| 6 | Company 1 |
| 7 | |
| 8 | Company 4 |
| 9 | Company 1 |
| 10 | Company 3 |
+----+-----------------+
Result in sheet named: 2
+---+-----------------+
| | A |
+---+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | Company 3 |
| 5 | Company 4 |
+---+-----------------+
Use this code in a regular module:
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
End Sub
The code return the date unsorted, just the way data appears.
if you want a sorted list, just add this code before the las line:
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
This way the result will be always sorted.
(The subrutine would be like this)
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
If you have any question about the code, I will glad to explain.

Saving partial spark DStream window to HDFS

I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
val counts = eegStreams(a).map(x => (math.round(x.toDouble), 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))
val sortedCounts = counts.map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap)
ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" + rdd.take(10).mkString("\n")))
sortedCounts.map(tuple => "%s,%s".format(tuple._1, tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))
I can print top 10 as above (commented).
I have also tried
sortedCounts.foreachRDD{ rdd => ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
but I get the following error. My Array is not serializable
15/01/05 17:12:23 ERROR actor.OneForOneStrategy:
org.apache.spark.streaming.StreamingContext
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
Can you try this?
sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...))
Note: I didn't test the snippet...
Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created.
The second version won't work b/c StreamingContext cannot be serialized in a closure.