As I understand, Joob Bookmarks prevents the duplicated data. "Enable" updates the data based on the previous data, and "disable" process the entire dataset (does this mean it overrides it? I tried this, but the job took for too long and i'm not sure if it does what i think it does.)
But what if I want to override the Dynamodb Table in the job? I've seen examples where the output data is in S3, but I'm not sure about the DynamoDB.
For example I have a Glue job like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Redshift Cluster
RedshiftCluster_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tr_bbd",
redshift_tmp_dir=args["TempDir"],
table_name="tr_bbd_vendor_info",
transformation_ctx="RedshiftCluster_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=RedshiftCluster_node1,
mappings=[
("vendor_code", "string", "vendor_code", "string"),
("vendor_group_id", "int", "vendor_group_id", "int"),
("vendor_group_status_name", "string", "vendor_group_status_name", "string")
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node DynamoDB bucket
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "VENDOR_TABLE",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thank you.
Related
I have a table including 13 records, we have a simple GLUE job where it's basically read from table and write to s3 bucket in parquet format as shown in the code in below. When we execute the job, the number of parquet files produced in S3, it's the same as total records we have, so it's writing each row in one parquet file. I don't understand why this is happening, and why it's not storing all the records just in the same parquet file. Is there any config or setting a parameter that we have missed to do?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1671552410216 = glueContext.create_dynamic_frame.from_catalog(
database="sql_database",
table_name="ob_cpanel_products",
transformation_ctx="AWSGlueDataCatalog_node1671552410216",
)
# Script generated for node Change Schema (Apply Mapping)
ChangeSchemaApplyMapping_node1671554868239 = ApplyMapping.apply(
frame=AWSGlueDataCatalog_node1671552410216,
mappings=[
("productid", "int", "productid", "long"),
("productcode", "string", "productcode", "string"),
("name", "string", "name", "string"),
("quantity", "int", "quantity", "long"),
("price", "decimal", "price", "decimal"),
],
transformation_ctx="ChangeSchemaApplyMapping_node1671554868239",
)
# Script generated for node Amazon S3
AmazonS3_node1671554880876 = glueContext.write_dynamic_frame.from_options(
frame=ChangeSchemaApplyMapping_node1671554868239,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://onebox-glue-etl-pre/etl-tpv/output/with_partition/",
"partitionKeys": [],
},
format_options={"compression": "gzip"},
transformation_ctx="AmazonS3_node1671554880876",
)
job.commit()
you should try to do the repartition before writing that dataframe
repartition_dataframe = ChangeSchemaApplyMapping_node1671554868239.repartition(<number of file count>)
#You can use coalesce also instead of repartition
I am new to AWS. I have four json files in S3 bucket. I just need to copy these four JSON files to another S3 bucket.
Below is my JSON files in the S3 bucket
02-12.json
03-12.Json
04-12.Json
05-12.Json
When copying into another bucket I am getting below results
I am using below code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Amazon S3
AmazonS3_node1664415190345 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="json",
connection_options={
"paths": ["s3://xxx-xxxx-yy/test/"],
"recurse": True,
},
transformation_ctx="AmazonS3_node1664415190345",
)
# Script generated for node Amazon S3
AmazonS3_node1664415242024 = glueContext.write_dynamic_frame.from_options(
frame=AmazonS3_node1664415190345,
connection_type="s3",
format="json",
connection_options={
"path": "s3://xxx-yyyy-www/yyyy/ff/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node1664415242024",
)
job.commit()
Can anyone advise what is wrong here?
if you can exec to the "aws" command, "aws s3 cp" is the easiest way to copy objects between buckets
I have used the new AWS Glue Studio visual tool to just try run a very simple SQL query, with Source as a Catalog Table, Transform as a simple SparkSQL, and Target as a CSV file(s) in an s3 bucket.
Each time I run the code, it succeeds but nothing is stored in the bucket, not even an empty CSV file.
Not sure if this is a SparkSQL problem, or an AWS Glue problem.
Here is the automatically generated code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data_Catalog_0
Data_Catalog_0_node1 = glueContext.create_dynamic_frame.from_catalog(
database="some_long_name_data_base_catalog",
table_name="catalog_table",
transformation_ctx="Data_Catalog_0_node1",
)
# Script generated for node ApplyMapping
SqlQuery0 = """
SELECT DISTINCT "ID"
FROM myDataSource
"""
ApplyMapping_node2 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": Data_Catalog_0_node1},
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon S3
AmazonS3_node166237 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={
"path": "s3://target_bucket/results/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node166237",
)
job.commit()
This is very similar to this question, I am kind of reposting it, because I am unable to comment on it due to the low points, and although 4 Months old, still unanswered.
The problem seems to be the double-quotes of the selected fields in the SQL query. Dropping them solved the issue.
In other words, I "wrongly" used this query syntax:
SELECT DISTINCT "ID"
FROM myDataSource
instead of this "correct" one :
SELECT DISTINCT ID
FROM myDataSource
There is no mention of it in the Spark SQL Syntax documentation
I am trying to get all records from a RDS postgresql in AWS and load than in a S3 bucket as csv file.
The script i am using is:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node JDBC Connection
JDBCConnection_node1 = glueContext.create_dynamic_frame.from_catalog(
database="core",
table_name="core_public_table",
transformation_ctx="JDBCConnection_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=JDBCConnection_node1,
mappings=[
("sentiment", "string", "sentiment", "string"),
("scheduling", "long", "scheduling", "long"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={"path": "s3://path_to_s3", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()
But this job is failing due to An error occurred while calling getDynamicFrame
I read every post about in stackoverflow but I can't solve it. Can anyone please help me with this issue?
I'm trying to convert a dataframe to a Dynamic Frame using the toDF and fromDF functions (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) as per the below code snippet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "test-3", table_name = "test", transformation_ctx = "datasource0"]
## #return: datasource0
## #inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-3", table_name = "test", transformation_ctx = "datasource0")
foo = datasource0.toDF()
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
However, I'm getting an error on the line:
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
The error says:
NameError: name 'DynamicFrame' is not defined
I've tried the usual googling to no avail, I can't see what I've done wrong from other examples. Does anyone know why I'm getting this error and how to resolve it?
from awsglue.dynamicframe import DynamicFrame
Import DynamicFrame
You need to import the DynamicFrame class from awsglue.dynamicframe module:
from awsglue.dynamicframe import DynamicFrame
There are lot of things missing in the examples provided with the AWS Glue ETL documentation.
However, you can refer to the following GitHub repository which contains lots of examples for performing basic tasks with Glue ETL:
AWS Glue samples