How apply regexp_replace with unicode characters in spark hive - regex

I'm try to count the number of occurrences of emoticons in the string in spark dataframe.
I use SQLTransformer.
My statement:
select LENGTH(regexp_replace(text, '[^\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+', '')) as count_emoji from __THIS__
But this statement doesn't work.
What am I doing wrong?

It looks like your SQLTransform is working. Please find the code as below.
object SparkHiveExample extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("Spark Hive Example")
.getOrCreate()
import spark.implicits._
//Prepare Test Data
val df = Seq("hello, how are you?\uD83D\uDE0A\uD83D\uDE0A\uD83D\uDE0A")
.toDF("text")
df.show(false)
+-------------------------+
|text |
+-------------------------+
|hello, how are you?😊😊😊|
+-------------------------+
df.createOrReplaceTempView("__THIS__")
val finalDf = spark.sql("select LENGTH(regexp_replace(text,'[^\\\\uD83C-\\\\uDBFF\\\\uDC00-\\\\uDFFF]+', '')) as count_emoji from __THIS__")
finalDf.show(false)
+-----------+
|count_emoji|
+-----------+
|3 |
+-----------+
}
If you want read data from Hive table then instantiate SparkSession with HiveSupport and Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

Related

Extract a string from a another column using regexp_extract

I want to get data of s[0] from "column1":
sada/object=fan/sn=dadfs/s[0]=gsf,sdfs,sfdgs,/s[1]=dfsd,sdg,hte,/redirect=sdgfd/
Output should be values of s[0]
gsf,sdfs,sfdgs
I was trying to do using \ and it's not working
REGEXP_EXTRACT(column1, 's\\[0\\] = ([^&]+)')
This is in PySpark.
Input:
from pyspark.sql import functions as F
# Spark dataframe:
df = spark.createDataFrame([("sada/object=fan/sn=dadfs/s[0]=gsf,sdfs,sfdgs,/s[1]=dfsd,sdg,hte,/redirect=sdgfd/",)], ["column1"])
# SQL table:
df.createOrReplaceTempView("df")
PySpark:
df.select(F.regexp_extract('column1', r's\[0\]=(.*?),/', 1).alias('match')).show()
# +--------------+
# | match|
# +--------------+
# |gsf,sdfs,sfdgs|
# +--------------+
SQL:
spark.sql("select regexp_extract(column1, r's\\[0\\]=(.*?),/', 1) as match from df").show()
# +--------------+
# | match|
# +--------------+
# |gsf,sdfs,sfdgs|
# +--------------+

Change values within AWS Glue DynamicFrame columns

I am trying to change values within some columns of my DynamicFrame in a AWS Glue job.
I see there is a Map function that seems useful for the task, but I cannot make it work.
This is my code:
def map_values_in_columns(self, df):
df = Map.apply(frame = df, f = self._map_values_in_columns)
return df
def _map_values_in_columns(self, rec):
for k, v in self.config['value_mapping'].items():
column_name = self.config['value_mapping'][k]['column_name']
values = self.config['value_mapping'][k]['values']
for old_value, new_value in values.items():
if rec[column_name] == old_value:
rec[column_name] = new_value
return rec
My config file is a yaml file with this structure:
value_mapping:
column_1:
column_name: asd
values:
- old_value_1: new_value_1
- old_value_2: new_value_2
column_2:
column_name: dsa
- old_value_1: new_value_1
- old_value_2: new_value_2
The above method throws a serialisation error:
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o81.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
I am not sure if this is due to how I am implementing the Map method, or if I should use a completely different approach.
So the question:
How can I change multiple values within multiple columns using AWS DynamicFrame, trying to avoid conversion back and forth between DynamicFrames and DataFrames?
There are a few problems in your code and yaml config, I'm not going to debug them here. See a working sample below, this can also be executed locally in a jupyter notebook.
I have simplified the yaml to keep the parsing complexity low.
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
columns = ["id", "asd", "dsa"]
data = [("1", "retain", "old_val_dsa_1"), ("2", "old_val_asd_1", "old_val_dsa_2"), ("3", "old_val_asd_2", "retain"), ("4", None, "")]
df = spark.createDataFrame(data).toDF(*columns)
dyF = DynamicFrame.fromDF(df, glueContext, "test_dyF")
import yaml
config = yaml.load('''value_mapping:
asd:
old_val_asd_1: new_val_asd_1
old_val_asd_2: new_val_asd_2
dsa:
old_val_dsa_1: new_val_dsa_1
old_val_dsa_2: new_val_dsa_2''')
def map_values(rec):
for k, v in config['value_mapping'].items():
if rec[k] is not None:
replacement_val = v.get(rec[k])
if replacement_val is not None:
rec[k] = replacement_val
return rec
print("-- dyF --")
dyF.toDF().show()
mapped_dyF = Map.apply(frame = dyF, f = map_values)
print("-- mapped_dyF --")
mapped_dyF.toDF().show()
-- dyF --
+---+-------------+-------------+
| id| asd| dsa|
+---+-------------+-------------+
| 1| retain|old_val_dsa_1|
| 2|old_val_asd_1|old_val_dsa_2|
| 3|old_val_asd_2| retain|
| 4| null| |
+---+-------------+-------------+
-- mapped_dyF --
+-------------+-------------+---+
| asd| dsa| id|
+-------------+-------------+---+
| retain|new_val_dsa_1| 1|
|new_val_asd_1|new_val_dsa_2| 2|
|new_val_asd_2| retain| 3|
| null| | 4|
+-------------+-------------+---+```

How to create a typed empty MapType?

I have a dataframe schema that I want to match that has a column of type MapType(StringType(), StringType()). I tried the following implementations (using Spark 2.2.1):
import pyspark.sql.functions as fx
from pyspark.sql.types import *
df = spark.createDataFrame([[1]], ['id'])
df = df.withColumn("map", fx.udf(dict, MapType(StringType(), StringType()))())
df = df.withColumn("map2", fx.create_map().cast(MapType(StringType(), StringType())))
The second attempt without the udf gives me this casting error:
cannot resolve 'map()' due to data type mismatch: cannot cast MapType(NullType,NullType,false) to MapType(StringType,StringType,true)
Is there a correct way to write the second implementation (without the UDF)?
I'm not sure if this is the "correct way" but here is a way to do this without the udf:
Create a new dataframe by specifying a schema, and do a crossJoin():
df = spark.createDataFrame([[1]], ['id'])
data = [({},)]
schema = StructType([StructField("map2", MapType(StringType(), StringType()))])
df2 = spark.createDataFrame(data, schema)
df.crossJoin(df2).show()
+---+-----+
| id| map2|
+---+-----+
| 1|Map()|
+---+-----+

Spark 2.3.0 Read Text File With Header Option Not Working

The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this.
>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
>>>df = spark.read.option("header", "true")\
.option("delimiter", ",")\
.option("inferSchema", "true")\
.text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)
Returns the following:
[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'),
Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'),
Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]
>>>df.columns
Returns the following:
['value']
Issue
The issue is that you are using .text api call instead of .csv or .load. If you read the .text api documentation, it says
def text(self, paths):
"""Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame.
:param paths: string, or list of strings, for input path(s).
df = spark.read.text('python/test_support/sql/text-test.txt')
df.collect()
[Row(value=u'hello'), Row(value=u'this')]
"""
Solution using .csv
Change the .text function call to .csv and you should be fine as
df = spark.read.option("header", "true") \
.option("delimiter", ",") \
.option("inferSchema", "true") \
.csv("StockData/ETFs/aadr.us.txt")
df.show(2, truncate=False)
which should give you
+-------------------+------+------+------+------+------+-------+
|Date |Open |High |Low |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0 |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0 |
+-------------------+------+------+------+------+------+-------+
Solution using .load
.load would assume the file to be of parquet format if a format option is not defined. So you would need a format option to be defined as well
df = spark.read\
.format("com.databricks.spark.csv")\
.option("header", "true") \
.option("delimiter", ",") \
.option("inferSchema", "true") \
.load("StockData/ETFs/aadr.us.txt")
df.show(2, truncate=False)
I hope the answer is helpful
try the following:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('CaseStudy').getOrCreate()
df = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema", "true").load("file name")
df.show()

pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'
You need to use agg. Example:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
If your dataframe is large, you can try using pandas udf(GROUPED_AGG) to avoid memory error. It is also much faster.
Grouped aggregate Pandas UDFs are similar to Spark aggregate functions. Grouped aggregate Pandas UDFs are used with groupBy().agg() and pyspark.sql.Window. It defines an aggregation from one or more pandas.Series to a scalar value, where each pandas.Series represents a column within the group or window. pandas udf
example:
import pyspark.sql.functions as F
#F.pandas_udf('string', F.PandasUDFType.GROUPED_AGG)
def collect_list(name):
return ', '.join(name)
grouped_df = df.groupby('id').agg(collect_list(df["name"]).alias('names'))