Spark - remove special characters from rows Dataframe with different column types - regex

Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type
so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala).
And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this?
Thanks

One option is to define two udfs to handle string type column and Map type column separately:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1) Udf to handle string type columns:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2) Udf to handle Map type columns:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3) Apply udfs to corresponding columns to clean it up:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

Related

How do I find the most frequent element in a list in pyspark?

I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Keep only numbers in a string / remove all non-numbers

I have a string column where I only need to the numbers from each string, e.g.
A-123 -> 123
456 -> 456
7-X89 -> 789
How can this be done in PowerQuery?
Add column. In custom column formula type this one-liner:
= Text.Select( [Column], {"0".."9"} )
where [Column] is a string column with a mix of digits and other characters. It extracts numbers only. The new column is still a text column, so you have to change the type.
Edit. If there are dots and minus characters:
= Text.Select( [Column1], {"0".."9", "-", "."} ))
Alternatively, you can transform the existing column:
= Table.TransformColumns( #"PreviousStepName" , {{"Column", each Text.Select( _ , {"0".."9","-","."} ) }} )
An alternative solution is to split the values on each number, and remove blanks from the resulting list.
This result can be used as a new list of delimiters to be used with function Splitter.SplitTextByEachDelimiter to split the original text again and combine the resulting list to the final result.
Explanation: Splitter.SplitTextByEachDelimiter first splits on the first delimiter in the list, then on the second and so on. Note that this function creates a function that must be called with the original string as parameter, so like S.S(delimiters)(string).
Example code:
let
Source = Table1,
NumbersOnly = Table.TransformColumns(Source,{{"String", (string) => Text.Combine(Splitter.SplitTextByEachDelimiter(List.Select(Text.SplitAny(string,"0123456789"), each _ <> ""))(string))}})
in
NumbersOnly
First, create a custom function in PowerQuery using New Query - From Other Sources -> Blank Query. Open the Advanced Editor and paste the following code:
(source) =>
let
NumbersOnly = (char) => if Character.ToNumber(char) >=48 and Character.ToNumber(char) < 58 then char else "",
Len = Text.Length(source),
Acc = List.Accumulate(
List.Generate( () => 0, each _ < Len, each _ + 1),
"",
(acc, index) => acc& NumbersOnly(Text.At(source, index))
),
AsNumber = Number.FromText(Acc)
in
AsNumber
Name this query NumbersOnly.
Now in your main query, add another calculated column where you call this NumbersOnly function with the source column, e.g.:
let
Source = Table.FromRecords({[text="A-123"], [text="456"], [text="7-X89"]}),
Result = Table.AddColumn(Source, "Values", each NumbersOnly([text]), Int64.Type)
in
Result

Pattern matching - spark scala RDD

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type
Description: RDD[(String, Int)]
Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.
Input Data :
BMW 1er Model,278
MINI Cooper Model,248
Output I am looking for :
Input | Brand | Series
BMW 1er Model,278, BMW , 1er
MINI Cooper Model ,248 MINI , Cooper
where Brand and Series are newly calculated substrings from String RDD
What I have done so far.
I could achieve this for a String using regular expression, but I cani apply fro all lines.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI
Then I can use
brandRegEx.findFirstIn("hello this mini is bmW testing")
But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.
I read about this code snippet, but not sure how to put it altogether.
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
def getBrand(Col4: String) : String = Col4 match {
case brandRegEx(str) =>
case _ => ""
return 'substring
}
Any help would be appreciated !
Thanks
To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):
import org.apache.spark.{SparkContext, SparkConf}
object Example extends App {
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))
val data = Seq(
("BMW 1er Model",278),
("MINI Cooper Model",248))
val dataRDD = sc.parallelize(data)
val processedRDD = dataRDD.map{
case (inString, inInt) =>
val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
val brand = brandRegEx.findFirstIn(inString)
//val seriesRegEx = ...
//val series = seriesRegEx.findFirstIn(inString)
val series = "foo"
(inString, inInt, brand, series)
}
processedRDD.collect().foreach(println)
sc.stop()
}
Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:
(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)
But if you correct your regexes for your needs, this is how you can apply them to each row.
hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.
val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect