How to get data from RDS postgresql to S3

How to get data from RDS postgresql to S3 - amazon-web-services

I am trying to get all records from a RDS postgresql in AWS and load than in a S3 bucket as csv file.
The script i am using is:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node JDBC Connection
JDBCConnection_node1 = glueContext.create_dynamic_frame.from_catalog(
database="core",
table_name="core_public_table",
transformation_ctx="JDBCConnection_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=JDBCConnection_node1,
mappings=[
("sentiment", "string", "sentiment", "string"),
("scheduling", "long", "scheduling", "long"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={"path": "s3://path_to_s3", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()
But this job is failing due to An error occurred while calling getDynamicFrame
I read every post about in stackoverflow but I can't solve it. Can anyone please help me with this issue?

Related

Why parquet files created by AWS GLUE in S3 is the same amount of total records in MYSQL table?

I have a table including 13 records, we have a simple GLUE job where it's basically read from table and write to s3 bucket in parquet format as shown in the code in below. When we execute the job, the number of parquet files produced in S3, it's the same as total records we have, so it's writing each row in one parquet file. I don't understand why this is happening, and why it's not storing all the records just in the same parquet file. Is there any config or setting a parameter that we have missed to do?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1671552410216 = glueContext.create_dynamic_frame.from_catalog(
database="sql_database",
table_name="ob_cpanel_products",
transformation_ctx="AWSGlueDataCatalog_node1671552410216",
)
# Script generated for node Change Schema (Apply Mapping)
ChangeSchemaApplyMapping_node1671554868239 = ApplyMapping.apply(
frame=AWSGlueDataCatalog_node1671552410216,
mappings=[
("productid", "int", "productid", "long"),
("productcode", "string", "productcode", "string"),
("name", "string", "name", "string"),
("quantity", "int", "quantity", "long"),
("price", "decimal", "price", "decimal"),
],
transformation_ctx="ChangeSchemaApplyMapping_node1671554868239",
)
# Script generated for node Amazon S3
AmazonS3_node1671554880876 = glueContext.write_dynamic_frame.from_options(
frame=ChangeSchemaApplyMapping_node1671554868239,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://onebox-glue-etl-pre/etl-tpv/output/with_partition/",
"partitionKeys": [],
},
format_options={"compression": "gzip"},
transformation_ctx="AmazonS3_node1671554880876",
)
job.commit()

you should try to do the repartition before writing that dataframe
repartition_dataframe = ChangeSchemaApplyMapping_node1671554868239.repartition(<number of file count>)
#You can use coalesce also instead of repartition

Json file is not recognising when copy json files from one s3 bucket to another s3 bucket?

I am new to AWS. I have four json files in S3 bucket. I just need to copy these four JSON files to another S3 bucket.
Below is my JSON files in the S3 bucket
02-12.json
03-12.Json
04-12.Json
05-12.Json
When copying into another bucket I am getting below results
I am using below code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Amazon S3
AmazonS3_node1664415190345 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="json",
connection_options={
"paths": ["s3://xxx-xxxx-yy/test/"],
"recurse": True,
},
transformation_ctx="AmazonS3_node1664415190345",
)
# Script generated for node Amazon S3
AmazonS3_node1664415242024 = glueContext.write_dynamic_frame.from_options(
frame=AmazonS3_node1664415190345,
connection_type="s3",
format="json",
connection_options={
"path": "s3://xxx-yyyy-www/yyyy/ff/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node1664415242024",
)
job.commit()
Can anyone advise what is wrong here?

if you can exec to the "aws" command, "aws s3 cp" is the easiest way to copy objects between buckets

How to override dynamoDB data using Glue job

As I understand, Joob Bookmarks prevents the duplicated data. "Enable" updates the data based on the previous data, and "disable" process the entire dataset (does this mean it overrides it? I tried this, but the job took for too long and i'm not sure if it does what i think it does.)
But what if I want to override the Dynamodb Table in the job? I've seen examples where the output data is in S3, but I'm not sure about the DynamoDB.
For example I have a Glue job like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Redshift Cluster
RedshiftCluster_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tr_bbd",
redshift_tmp_dir=args["TempDir"],
table_name="tr_bbd_vendor_info",
transformation_ctx="RedshiftCluster_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=RedshiftCluster_node1,
mappings=[
("vendor_code", "string", "vendor_code", "string"),
("vendor_group_id", "int", "vendor_group_id", "int"),
("vendor_group_status_name", "string", "vendor_group_status_name", "string")
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node DynamoDB bucket
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "VENDOR_TABLE",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thank you.

Pyspark code with Spigot function not working in AWS Glue

I am new to AWS Glue. As per AWS Glue documentation, Spigot function will help you to write sample records from a dynamicFrame to an S3 Directory. But when I run this, it is not creating any file under that S3 directory. Any inputs on where I am doing wrong. Below is the test code.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurveydb", table_name = "amssurvey", transformation_ctx = "datasource0")
split1 = SplitRows.apply(datasource0, {"count": {">": 50}}, "split11", "split12", transformation_ctx ="split1")## #type: SplitRows
selFromCol1 = SelectFromCollection.apply(dfc = split1, key = "split11", transformation_ctx = "selFromCol1")
selFromCol2 = SelectFromCollection.apply(dfc = split1, key = "split12", transformation_ctx = "selFromCol2")
spigot1 = Spigot.apply(frame = selFromCol1, path = "s3://asgqatestautomation3/SourceFiles/spigot1Op", options = {"topk":5},transformation_ctx ="spigot1")
job.commit()

AWS Glue NameError: name 'DynamicFrame' is not defined

I'm trying to convert a dataframe to a Dynamic Frame using the toDF and fromDF functions (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) as per the below code snippet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "test-3", table_name = "test", transformation_ctx = "datasource0"]
## #return: datasource0
## #inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-3", table_name = "test", transformation_ctx = "datasource0")
foo = datasource0.toDF()
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
However, I'm getting an error on the line:
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
The error says:
NameError: name 'DynamicFrame' is not defined
I've tried the usual googling to no avail, I can't see what I've done wrong from other examples. Does anyone know why I'm getting this error and how to resolve it?

from awsglue.dynamicframe import DynamicFrame
Import DynamicFrame

You need to import the DynamicFrame class from awsglue.dynamicframe module:
from awsglue.dynamicframe import DynamicFrame
There are lot of things missing in the examples provided with the AWS Glue ETL documentation.
However, you can refer to the following GitHub repository which contains lots of examples for performing basic tasks with Glue ETL:
AWS Glue samples

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to get data from RDS postgresql to S3 - amazon-web-services

Related

Why parquet files created by AWS GLUE in S3 is the same amount of total records in MYSQL table?

Json file is not recognising when copy json files from one s3 bucket to another s3 bucket?

How to override dynamoDB data using Glue job

Pyspark code with Spigot function not working in AWS Glue

AWS Glue NameError: name 'DynamicFrame' is not defined

Categories

Resources