Pyspark code with Spigot function not working in AWS Glue

Pyspark code with Spigot function not working in AWS Glue - amazon-web-services

I am new to AWS Glue. As per AWS Glue documentation, Spigot function will help you to write sample records from a dynamicFrame to an S3 Directory. But when I run this, it is not creating any file under that S3 directory. Any inputs on where I am doing wrong. Below is the test code.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurveydb", table_name = "amssurvey", transformation_ctx = "datasource0")
split1 = SplitRows.apply(datasource0, {"count": {">": 50}}, "split11", "split12", transformation_ctx ="split1")## #type: SplitRows
selFromCol1 = SelectFromCollection.apply(dfc = split1, key = "split11", transformation_ctx = "selFromCol1")
selFromCol2 = SelectFromCollection.apply(dfc = split1, key = "split12", transformation_ctx = "selFromCol2")
spigot1 = Spigot.apply(frame = selFromCol1, path = "s3://asgqatestautomation3/SourceFiles/spigot1Op", options = {"topk":5},transformation_ctx ="spigot1")
job.commit()

Related

How to get data from RDS postgresql to S3

I am trying to get all records from a RDS postgresql in AWS and load than in a S3 bucket as csv file.
The script i am using is:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node JDBC Connection
JDBCConnection_node1 = glueContext.create_dynamic_frame.from_catalog(
database="core",
table_name="core_public_table",
transformation_ctx="JDBCConnection_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=JDBCConnection_node1,
mappings=[
("sentiment", "string", "sentiment", "string"),
("scheduling", "long", "scheduling", "long"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={"path": "s3://path_to_s3", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()
But this job is failing due to An error occurred while calling getDynamicFrame
I read every post about in stackoverflow but I can't solve it. Can anyone please help me with this issue?

Can I apply AWS FindMatch transform on dataframe ? If yes then how

I wanted to find out if I can apply the FindMatch ml transform in AWS Glue on a spark dataframe. Currently I can use it on a dynamicframe. Below is the syntax if i want to use the findmatch transform on a dynamic frame.
<output DynamicFrame on which the ml transform has been applied> =
FindMatches.apply(frame = <Input DynamicFrame>, transformId = <transformation
id of the findmatch ml transform created separately>)
I have tried using a dataframe in place of the input dynamic frame and when I run the Glue job, it fails. Error shown is as below
"Attribute Error: 'DataFrame' object has no attribute 'glue_ctx'"
Below is the code i tried where i tried using a dataframe
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglueml.transforms import FindMatches
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hospitality", table_name =
"personinputdata", transformation_ctx = "datasource0")
df0 = datasource0.toDF()
resolvechoice1 = ResolveChoice.apply(frame = datasource0, choice = "MATCH_CATALOG", database =
"hospitality", table_name = "personinputdata", transformation_ctx = "resolvechoice1")
findmatchdf = FindMatches.apply(frame = df0, transformId = "tfm-
01cc9b02c93640cfc7ce5ea91745e24258cb2e01")
findmatchdf.show()
And below is the code when instead of a dataframe i tried using a dynamicframe and the code works.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglueml.transforms import FindMatches
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hospitality", table_name =
"patientinputdata", transformation_ctx = "datasource0")
resolvechoice1 = ResolveChoice.apply(frame = datasource0, choice = "MATCH_CATALOG", database =
"hospitality", table_name = "patientinputdata", transformation_ctx = "resolvechoice1")
findmatches2 = FindMatches.apply(frame = resolvechoice1, transformId = "tfm-
0cadd1e6d2da40d7c18db7836e92be93833b6019", transformation_ctx = "findmatches2")
I tried searching online if I could find the code for FindMatch ml transform but could not find it anywhere.

FindMatch works on dynamic frames only as you already know...
So you can convert your spark df to dynamic frame whenever you want to run it
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
Dyf0 = DynamicFrame.fromDF(df0, glueContext, "anyname")
And then run your FindMatch as required.

Create paritioned data using AWS Glue and save into s3

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_name", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dateregistered", "timestamp", "dateregistered", "timestamp"), ("id", "int", "id", "int")], transformation_ctx = "applymapping1")
df = applymapping1.toDF()
repartitioned_with_new_columns_df = applymapping1.select("*")
.withColumn("date_col", to_date(from_unixtime(col("dateRegistered"))))
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.drop(col("date_col"))
#.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "bucket-path",
"partitionKeys": ["year", "month", "day"]
},
format = "json",
transformation_ctx = "datasink")
job.commit()
I have above script and i cant figure out why is not working, or if it is even the correct way.
Could someone please review and let me know what i am doing wrong?
The goal here is to run this job daily, and write this table partitioned as above and save it in s3 either json or parquet.

You are referring to the wrong data frame when manipulating the columns.
applymapping1.select("*") should actually be df.select("*")

AWS Glue NameError: name 'DynamicFrame' is not defined

I'm trying to convert a dataframe to a Dynamic Frame using the toDF and fromDF functions (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) as per the below code snippet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "test-3", table_name = "test", transformation_ctx = "datasource0"]
## #return: datasource0
## #inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-3", table_name = "test", transformation_ctx = "datasource0")
foo = datasource0.toDF()
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
However, I'm getting an error on the line:
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
The error says:
NameError: name 'DynamicFrame' is not defined
I've tried the usual googling to no avail, I can't see what I've done wrong from other examples. Does anyone know why I'm getting this error and how to resolve it?

from awsglue.dynamicframe import DynamicFrame
Import DynamicFrame

You need to import the DynamicFrame class from awsglue.dynamicframe module:
from awsglue.dynamicframe import DynamicFrame
There are lot of things missing in the examples provided with the AWS Glue ETL documentation.
However, you can refer to the following GitHub repository which contains lots of examples for performing basic tasks with Glue ETL:
AWS Glue samples

AWS Glue hanging up & consuming lot of time in ETL job

I am using AWS Glue where I want to dump records from Oracle table (which has 80 million rows) to Redshift. However, almost 2 hrs go,it remains in hanging state & still nothing gets written to Amazon S3 & eventually I have to stop the job.
My code:
import sys
import boto3
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
db_username = [removed]
db_password = [removed]
db_url = [removed]
table_name = [removed]
jdbc_driver_name = "oracle.jdbc.OracleDriver"
s3_output = [removed]
df = glueContext.read.format("jdbc").option("url", db_url).option("user", db_username).option("password", db_password).option("dbtable", table_name).option("driver", jdbc_driver_name).load()
df.printSchema()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
datasource0.schema()
datasource0.show()
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("correlation_id", "decimal", "correlation_id", "bigint"), ("machine_pin","varchar","machine_pin","varchar"),("messageguid","varchar","messageguid","varchar"), ("originating_domain_object_id", "decimal", "originating_domain_object_id", "bigint"), ("originating_message_type_id", "bigint", "originating_message_type_id", "bigint"), ("source_messageguid","varchar","source_messageguid","varchar"), ("timestamp_of_request","timestamp","timestamp_of_request","timestamp"),("token","varchar","token","varchar"),("id","decimal","id","bigint"),("file_attachment","decimal","file_attachment","bigint")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1,choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "us01-isg-analytics", connection_options = {"dbtable": "analytics_team_data.message_details", "database": "jk_test"}, redshift_tmp_dir = "s3://aws-glue-scripts-823837687343-us-east-1/glue_op/", transformation_ctx = "datasink4")
When I use Apache Spark,it takes less than 1 hr to dump the data to Redshift.What modifications need to be required for performance optimization so that Glue dumps the data in a speedy manner?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark code with Spigot function not working in AWS Glue - amazon-web-services

Related

How to get data from RDS postgresql to S3

Can I apply AWS FindMatch transform on dataframe ? If yes then how

Create paritioned data using AWS Glue and save into s3

AWS Glue NameError: name 'DynamicFrame' is not defined

AWS Glue hanging up & consuming lot of time in ETL job

Categories

Resources