Removal of special characters in a txt file using spark

Removal of special characters in a txt file using spark - regex

I have a txt file which comprises of data like
3,3e,4,5
3,5s,4#,5
5,6,2,4
and so on
now what I have to do is to remove these characters and using spark and then add all the values into an aggregated sum.
How to get remove the special characters and sum all the values.
I have created a dataframe and used regexp_replace to remove the special characters.
But by using .withColumn clause I can only remove the special characters one by one and not as a whole which I believe is not optimized code.
Secondly, I have to add all the values into an aggregated sum. How to get the aggregate value.

If you have fixed number of column in input data then you can use below approach.
//Input Text file
scala> val rdd = sc.textFile("/spath/stack.txt")
scala> rdd.collect()
res108: Array[String] = Array("3,3e,4,5 ", "3,5s,4#,5 ", 5,6,2,4)
//remove special characters
scala> val rdd1 = rdd.map{x => x.replaceAll("[^,0-9]", "")}
scala> rdd1.collect
res109: Array[String] = Array(3,3,4,5, 3,5,4,5, 5,6,2,4)
//Conver RDD into DataFrame
scala> val df = rdd1.map(_.split(",")).map(x => (x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt)).toDF
scala> df.show(false)
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|3 |3 |4 |5 |
|3 |5 |4 |5 |
|5 |6 |2 |4 |
+---+---+---+---+
//local UDF to sum up value
scala> val sumUDF = udf((r:Row) => {
| r.getAs("_1").toString.toInt + r.getAs("_2").toString.toInt + r.getAs("_3").toString.toInt + r.getAs("_4").toString.toInt
| })
//Expected DataFrame
scala> val finaldf = df.withColumn("sumcol", sumUDF(struct(rdd2.columns map col: _*)))
scala> finaldf.show(false)
+---+---+---+---+------+
|_1 |_2 |_3 |_4 |sumcol|
+---+---+---+---+------+
|3 |3 |4 |5 |15 |
|3 |5 |4 |5 |17 |
|5 |6 |2 |4 |17 |
+---+---+---+---+------+

Related

Change selected rows into columns

I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc

scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+

you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+

How to remove quotes from front and end of the string Scala

I have a dataframe where some strings contains "" in front and end of the string.
Eg:
+-------------------------------+
|data |
+-------------------------------+
|"john belushi" |
|"john mnunjnj" |
|"nmnj tyhng" |
|"John b-e_lushi" |
|"john belushi's book" |
Expected output:
+-------------------------------+
|data |
+-------------------------------+
|john belushi |
|john mnunjnj |
|nmnj tyhng |
|John b-e_lushi |
|john belushi's book |
I am trying to remove only " double quotes from the string. Can some one tell me how can I remove this in Scala ?
Python provide ltrim and rtrim. Is there any thing equivalent to that in Scala ?

Use expr, substring and length functions and get the substring from 2 and length() - 2
val df_d = List("\"john belushi\"", "\"John b-e_lushi\"", "\"john belushi's book\"")
.toDF("data")
Input:
+---------------------+
|data |
+---------------------+
|"john belushi" |
|"John b-e_lushi" |
|"john belushi's book"|
+---------------------+
Using expr, substring and length functions:
import org.apache.spark.sql.functions.expr
df_d.withColumn("data", expr("substring(data, 2, length(data) - 2)"))
.show(false)
Output:
+-------------------+
|data |
+-------------------+
|john belushi |
|John b-e_lushi |
|john belushi's book|
+-------------------+

How to remove quotes from front and end of the string Scala?
myString.substring(1, myString.length()-1) will remove the double quotes.
import spark.implicits._
val list = List("\"hi\"", "\"I am learning scala\"", "\"pls\"", "\"help\"").toDF()
list.show(false)
val finaldf = list.map {
row => {
val stringdoublequotestoberemoved = row.getAs[String]("value")
stringdoublequotestoberemoved.substring(1, stringdoublequotestoberemoved.length() - 1)
}
}
finaldf.show(false)
Result :
+--------------------+
| value|
+--------------------+
| "hi"|
|"I am learning sc...|
| "pls"|
| "help"|
+--------------------+
+-------------------+
| value|
+-------------------+
| hi|
|I am learning scala|
| pls|
| help|
+-------------------+

Try it
scala> val dataFrame = List("\"john belushi\"","\"john mnunjnj\"" , "\"nmnj tyhng\"" ,"\"John b-e_lushi\"", "\"john belushi's book\"").toDF("data")
scala> dataFrame.map { row => row.mkString.stripPrefix("\"").stripSuffix("\"")}.show
+-------------------+
| value|
+-------------------+
| john belushi|
| john mnunjnj|
| nmnj tyhng|
| John b-e_lushi|
|john belushi's book|
+-------------------+

Replace substring containing dollar sign ($) with other column value pyspark [duplicate]

This question already has answers here:
Spark column string replace when present in other column (row)
(2 answers)
Closed 3 years ago.
I am trying to replace the substring '$NUMBER' with the value in the column 'number' for each row.
I tried
from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType
replace_udf = udf(
lambda long_text, number: long_text.replace("$NUMBER", number),
StringType()
)
df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))
and
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))
but nothing works. I can't figure out how another column can be the replacement for the substring.
SAMPLE:
df1 = spark.createDataFrame(
[
("hahaha the $NUMBER is good",3),
("i dont know about $NUMBER",2),
("what is $NUMBER doing?",5),\
("ajajaj $NUMBER",2),
("$NUMBER dwarfs",1)
],
["long_text","number"]
)
INPUT:
+---------------------------------+------+
| long_text . |number|
+---------------------------------+------+
|hahaha the $NUMBER is good | 3|
| what is $NUMBER doing? | 5|
| ajajaj $NUMBER | 2|
+---------------------------------+------+
EXPECTED OUTPUT:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Similar question where the answeres didn't cover the column replacement:
Spark column string replace when present in other column (row)

The problem is that $ has a special meaning in regular expressions, which means match the end of the line. So your code:
regexp_replace(long_text, '$NUMBER', number)
Is trying to match the pattern: end of line followed by the literal string NUMBER (which can never match anything).
In order to match a $ (or any other regex special character), you have to escape it with a \.
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#| long_text|number|
#+--------------------+------+
#|hahaha the 3 is good| 3|
#| what is 5 doing?| 5|
#| ajajaj 2| 2|
#+--------------------+------+

You have to cast the number column to string with str() before you can use with replace in your lambda:
from pyspark.sql import types as T
from pyspark.sql import functions as F
l = [( 'hahaha the $NUMBER is good', 3)
,('what is $NUMBER doing?' , 5)
,('ajajaj $NUMBER ' , 2)]
df = spark.createDataFrame(l,['long_text','number'])
#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())
df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 2 | 2|
+--------------------+------+

Pyspark extract multivalued column to another table

I have a csv file with one of the columns named id and another one named genre, that can contain any number of them.
1,Action|Horror|Adventure
2,Action|Adventure
Is it possible to do something like select a row, and for each genre insert into another dataframe current id and genre.
1,Action
1,Horror
1,Adventure
2,Action
2,Adventure

You can use a udf to split the genre data and use explode function.
from pyspark.sql.functions import explode
from pyspark.sql.types import ArrayType,StringType
s = [('1','Action|Adventure'),('2','Comdey|Action')]
rdd = sc.parallelize(s)
df = sqlContext.createDataFrame(rdd,['id','Col'])
df.show()
+---+----------------+
| id| Col|
+---+----------------+
| 1|Action|Adventure|
| 2| Comdey|Action|
+---+----------------+
newcol = f.udf(lambda x : x.split('|'),ArrayType(StringType()))
df1 = df.withColumn('Genre',explode(newcol('col'))).drop('col')
df1.show()
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1|Adventure|
| 2| Comdey|
| 2| Action|
+---+---------+

In addition to Suresh solution, you can also use flatMap after splitting your string to achieve the same:
#Read csv from file (works in Spark 2.x and onwards
df_csv = sqlContext.read.csv("genre.csv")
#Split the Genre (y) on the character |, but leave the id (x) as is
rdd_split= df_csv.rdd.map(lambda (x,y):(x,y.split('|')))
#Use a list comprehension to add the id column to each Genre(y)
rdd_explode = rdd_split.flatMap(lambda (x,y):[(x,k) for k in y])
#Convert the resulting RDD back to a dataframe
df_final = rdd_explode.toDF(['id','Genre'])
df_final.show() returns this as output:
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1| Horror|
| 1|Adventure|
| 2| Action|
| 2|Adventure|
+---+---------+

if else in pyspark for collapsing column values

I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data

The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))

Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removal of special characters in a txt file using spark - regex

Related

Change selected rows into columns

How to remove quotes from front and end of the string Scala

Replace substring containing dollar sign ($) with other column value pyspark [duplicate]

Pyspark extract multivalued column to another table

if else in pyspark for collapsing column values

Categories

Resources