Pattern matching - spark scala RDD - regex

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type
Description: RDD[(String, Int)]
Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.
Input Data :
BMW 1er Model,278
MINI Cooper Model,248
Output I am looking for :
Input | Brand | Series
BMW 1er Model,278, BMW , 1er
MINI Cooper Model ,248 MINI , Cooper
where Brand and Series are newly calculated substrings from String RDD
What I have done so far.
I could achieve this for a String using regular expression, but I cani apply fro all lines.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI
Then I can use
brandRegEx.findFirstIn("hello this mini is bmW testing")
But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.
I read about this code snippet, but not sure how to put it altogether.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
def getBrand(Col4: String) : String = Col4 match {
case brandRegEx(str) =>
case _ => ""
return 'substring
}
Any help would be appreciated !
Thanks

To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):
import org.apache.spark.{SparkContext, SparkConf}
object Example extends App {
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))
val data = Seq(
("BMW 1er Model",278),
("MINI Cooper Model",248))
val dataRDD = sc.parallelize(data)
val processedRDD = dataRDD.map{
case (inString, inInt) =>
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
val brand = brandRegEx.findFirstIn(inString)
//val seriesRegEx = ...
//val series = seriesRegEx.findFirstIn(inString)
val series = "foo"
(inString, inInt, brand, series)
}
processedRDD.collect().foreach(println)
sc.stop()
}
Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:
(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)
But if you correct your regexes for your needs, this is how you can apply them to each row.

hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.
val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect

Related

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

Spark - remove special characters from rows Dataframe with different column types

Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type
so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala).
And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this?
Thanks
One option is to define two udfs to handle string type column and Map type column separately:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1) Udf to handle string type columns:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2) Udf to handle Map type columns:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3) Apply udfs to corresponding columns to clean it up:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

Split Pandas Column by values that are in a list

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?
the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

Erlang record item list

For example i have erlang record:
-record(state, {clients
}).
Can i make from clients field list?
That I could keep in client filed as in normal list? And how can i add some values in this list?
Thank you.
Maybe you mean something like:
-module(reclist).
-export([empty_state/0, some_state/0,
add_client/1, del_client/1,
get_clients/1]).
-record(state,
{
clients = [] ::[pos_integer()],
dbname ::char()
}).
empty_state() ->
#state{}.
some_state() ->
#state{
clients = [1,2,3],
dbname = "QA"}.
del_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = lists:delete(Client, C)}.
add_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = [Client|C]}.
get_clients(#state{clients = C, dbname = _D}) ->
C.
Test:
1> reclist:empty_state().
{state,[],undefined}
2> reclist:some_state().
{state,[1,2,3],"QA"}
3> reclist:add_client(4).
{state,[4,1,2,3],"QA"}
4> reclist:del_client(2).
{state,[1,3],"QA"}
::[pos_integer()] means that the type of the field is a list of positive integer values, starting from 1; it's the hint for the analysis tool dialyzer, when it performs type checking.
Erlang also allows you use pattern matching on records:
5> reclist:get_clients(reclist:some_state()).
[1,2,3]
Further reading:
Records
Types and Function Specifications
dialyzer(1)
#JUST MY correct OPINION's answer made me remember that I love how Haskell goes about getting the values of the fields in the data type.
Here's a definition of a data type, stolen from Learn You a Haskell for Great Good!, which leverages record syntax:
data Car = Car {company :: String
,model :: String
,year :: Int
} deriving (Show)
It creates functions company, model and year, that lookup fields in the data type. We first make a new car:
ghci> Car "Toyota" "Supra" 2005
Car {company = "Toyota", model = "Supra", year = 2005}
Or, using record syntax (the order of fields doesn't matter):
ghci> Car {model = "Supra", year = 2005, company = "Toyota"}
Car {company = "Toyota", model = "Supra", year = 2005}
ghci> let supra = Car {model = "Supra", year = 2005, company = "Toyota"}
ghci> year supra
2005
We can even use pattern matching:
ghci> let (Car {company = c, model = m, year = y}) = supra
ghci> "This " ++ c ++ " " ++ m ++ " was made in " ++ show y
"This Toyota Supra was made in 2005"
I remember there were attempts to implement something similar to Haskell's record syntax in Erlang, but not sure if they were successful.
Some posts, concerning these attempts:
In Response to "What Sucks About Erlang"
Geeking out with Lisp Flavoured Erlang. However I would ignore parameterized modules here.
It seems that LFE uses macros, which are similar to what provides Scheme (Racket, for instance), when you want to create a new value of some structure:
> (define-struct car (company model year))
> (define supra (make-car "Toyota" "Supra" 2005))
> (car-model supra)
"Supra"
I hope we'll have something close to Haskell record syntax in the future, that would be really practically useful and handy.
Yasir's answer is the correct one, but I'm going to show you WHY it works the way it works so you can understand records a bit better.
Records in Erlang are a hack (and a pretty ugly one). Using the record definition from Yasir's answer...
-record(state,
{
clients = [] ::[pos_integer()],
dbname ::char()
}).
...when you instantiate this with #state{} (as Yasir did in empty_state/0 function), what you really get back is this:
{state, [], undefined}
That is to say your "record" is just a tuple tagged with the name of the record (state in this case) followed by the record's contents. Inside BEAM itself there is no record. It's just another tuple with Erlang data types contained within it. This is the key to understanding how things work (and the limitations of records to boot).
Now when Yasir did this...
add_client(Client) ->
S = some_state(),
C = S#state.clients,
S#state{clients = [Client|C]}.
...the S#state.clients bit translates into code internally that looks like element(2,S). You're using, in other words, standard tuple manipulation functions. S#state.clients is just a symbolic way of saying the same thing, but in a way that lets you know what element 2 actually is. It's syntactic saccharine that's an improvement over keeping track of individual fields in your tuples in an error-prone way.
Now for that last S#state{clients = [Client|C]} bit, I'm not absolutely positive as to what code is generated behind the scenes, but it is likely just straightforward stuff that does the equivalent of {state, [Client|C], element(3,S)}. It:
tags a new tuple with the name of the record (provided as #state),
copies the elements from S (dictated by the S# portion),
except for the clients piece overridden by {clients = [Client|C]}.
All of this magic is done via a preprocessing hack behind the scenes.
Understanding how records work behind the scenes is beneficial both for understanding code written using records as well as for understanding how to use them yourself (not to mention understanding why things that seem to "make sense" don't work with records -- because they don't actually exist down in the abstract machine...yet).
If you are only adding or removing single items from the clients list in the state you could cut down on typing with a macro.
-record(state, {clients = [] }).
-define(AddClientToState(Client,State),
State#state{clients = lists:append([Client], State#state.clients) } ).
-define(RemoveClientFromState(Client,State),
State#state{clients = lists:delete(Client, State#state.clients) } ).
Here is a test escript that demonstrates:
#!/usr/bin/env escript
-record(state, {clients = [] }).
-define(AddClientToState(Client,State),
State#state{clients = lists:append([Client], State#state.clients)} ).
-define(RemoveClientFromState(Client,State),
State#state{clients = lists:delete(Client, State#state.clients)} ).
main(_) ->
%Start with a state with a empty list of clients.
State0 = #state{},
io:format("Empty State: ~p~n",[State0]),
%Add foo to the list
State1 = ?AddClientToState(foo,State0),
io:format("State after adding foo: ~p~n",[State1]),
%Add bar to the list.
State2 = ?AddClientToState(bar,State1),
io:format("State after adding bar: ~p~n",[State2]),
%Add baz to the list.
State3 = ?AddClientToState(baz,State2),
io:format("State after adding baz: ~p~n",[State3]),
%Remove bar from the list.
State4 = ?RemoveClientFromState(bar,State3),
io:format("State after removing bar: ~p~n",[State4]).
Result:
Empty State: {state,[]}
State after adding foo: {state,[foo]}
State after adding bar: {state,[bar,foo]}
State after adding baz: {state,[baz,bar,foo]}
State after removing bar: {state,[baz,foo]}

Scala - how to get a list element

I'm trying to get element from a list:
data =List(("2001",13.1),("2009",3.1),("2004",24.0),("2011",1.11))
Any help? Task is to print separatly Strings and numbers like:
print(x._1+" "+x._2)
but this is not working.
One good practice with functional programming is to do as much as possible with side-effect-free transformations of immutable objects.
What that means (in this case) is that you can convert the list of tuples to a list of strings, then limit your side-effect (the println) to a single step at the very end.
val data = List(("2001",13.1),("2009",3.1),("2004",24.0),("2011",1.11))
val lines = data map { case(a,b) => a + " " + b.toString }
println(lines mkString "\n")
scala> val data =List(("2001",13.1),("2009",3.1),("2004",24.0),("2011",1.11))
data: List[(java.lang.String, Double)] = List((2001,13.1), (2009,3.1), (2004,24.0), (2011,1.11))
scala> data.foreach(x => println(x._1+" "+x._2))
2001 13.1
2009 3.1
2004 24.0
2011 1.11
val list = List(("2001",13.1),("2009",3.1),("2004",24.0),("2011",1.11))
println(list map (_.productIterator mkString " ") mkString "\n")
2001 13.1
2009 3.1
2004 24.0
2011 1.11
I would use pattern matching which yields a programming pattern that scales better for larger tuples and more complex elements:
data.foreach { case (b,c) => println(b + " " + c) }
for the Strings, use List((1,"aoeu")).foreach(((_:Tuple2[String,_])._1) andThen print)
for the numbers, use List(("aoeu",13.0)).foreach(((_:Tuple2[_,Double])._2) andThen print)