Why parquet files created by AWS GLUE in S3 is the same amount of total records in MYSQL table? - amazon-web-services

I have a table including 13 records, we have a simple GLUE job where it's basically read from table and write to s3 bucket in parquet format as shown in the code in below. When we execute the job, the number of parquet files produced in S3, it's the same as total records we have, so it's writing each row in one parquet file. I don't understand why this is happening, and why it's not storing all the records just in the same parquet file. Is there any config or setting a parameter that we have missed to do?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1671552410216 = glueContext.create_dynamic_frame.from_catalog(
database="sql_database",
table_name="ob_cpanel_products",
transformation_ctx="AWSGlueDataCatalog_node1671552410216",
)
# Script generated for node Change Schema (Apply Mapping)
ChangeSchemaApplyMapping_node1671554868239 = ApplyMapping.apply(
frame=AWSGlueDataCatalog_node1671552410216,
mappings=[
("productid", "int", "productid", "long"),
("productcode", "string", "productcode", "string"),
("name", "string", "name", "string"),
("quantity", "int", "quantity", "long"),
("price", "decimal", "price", "decimal"),
],
transformation_ctx="ChangeSchemaApplyMapping_node1671554868239",
)
# Script generated for node Amazon S3
AmazonS3_node1671554880876 = glueContext.write_dynamic_frame.from_options(
frame=ChangeSchemaApplyMapping_node1671554868239,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://onebox-glue-etl-pre/etl-tpv/output/with_partition/",
"partitionKeys": [],
},
format_options={"compression": "gzip"},
transformation_ctx="AmazonS3_node1671554880876",
)
job.commit()

you should try to do the repartition before writing that dataframe
repartition_dataframe = ChangeSchemaApplyMapping_node1671554868239.repartition(<number of file count>)
#You can use coalesce also instead of repartition

Related

Json file is not recognising when copy json files from one s3 bucket to another s3 bucket?

I am new to AWS. I have four json files in S3 bucket. I just need to copy these four JSON files to another S3 bucket.
Below is my JSON files in the S3 bucket
02-12.json
03-12.Json
04-12.Json
05-12.Json
When copying into another bucket I am getting below results
I am using below code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Amazon S3
AmazonS3_node1664415190345 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="json",
connection_options={
"paths": ["s3://xxx-xxxx-yy/test/"],
"recurse": True,
},
transformation_ctx="AmazonS3_node1664415190345",
)
# Script generated for node Amazon S3
AmazonS3_node1664415242024 = glueContext.write_dynamic_frame.from_options(
frame=AmazonS3_node1664415190345,
connection_type="s3",
format="json",
connection_options={
"path": "s3://xxx-yyyy-www/yyyy/ff/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node1664415242024",
)
job.commit()
Can anyone advise what is wrong here?
if you can exec to the "aws" command, "aws s3 cp" is the easiest way to copy objects between buckets

Glue Job Succeeded but no data inserted into the target bucket

I have used the new AWS Glue Studio visual tool to just try run a very simple SQL query, with Source as a Catalog Table, Transform as a simple SparkSQL, and Target as a CSV file(s) in an s3 bucket.
Each time I run the code, it succeeds but nothing is stored in the bucket, not even an empty CSV file.
Not sure if this is a SparkSQL problem, or an AWS Glue problem.
Here is the automatically generated code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data_Catalog_0
Data_Catalog_0_node1 = glueContext.create_dynamic_frame.from_catalog(
database="some_long_name_data_base_catalog",
table_name="catalog_table",
transformation_ctx="Data_Catalog_0_node1",
)
# Script generated for node ApplyMapping
SqlQuery0 = """
SELECT DISTINCT "ID"
FROM myDataSource
"""
ApplyMapping_node2 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": Data_Catalog_0_node1},
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon S3
AmazonS3_node166237 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={
"path": "s3://target_bucket/results/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node166237",
)
job.commit()
This is very similar to this question, I am kind of reposting it, because I am unable to comment on it due to the low points, and although 4 Months old, still unanswered.
The problem seems to be the double-quotes of the selected fields in the SQL query. Dropping them solved the issue.
In other words, I "wrongly" used this query syntax:
SELECT DISTINCT "ID"
FROM myDataSource
instead of this "correct" one :
SELECT DISTINCT ID
FROM myDataSource
There is no mention of it in the Spark SQL Syntax documentation

How to override dynamoDB data using Glue job

As I understand, Joob Bookmarks prevents the duplicated data. "Enable" updates the data based on the previous data, and "disable" process the entire dataset (does this mean it overrides it? I tried this, but the job took for too long and i'm not sure if it does what i think it does.)
But what if I want to override the Dynamodb Table in the job? I've seen examples where the output data is in S3, but I'm not sure about the DynamoDB.
For example I have a Glue job like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Redshift Cluster
RedshiftCluster_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tr_bbd",
redshift_tmp_dir=args["TempDir"],
table_name="tr_bbd_vendor_info",
transformation_ctx="RedshiftCluster_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=RedshiftCluster_node1,
mappings=[
("vendor_code", "string", "vendor_code", "string"),
("vendor_group_id", "int", "vendor_group_id", "int"),
("vendor_group_status_name", "string", "vendor_group_status_name", "string")
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node DynamoDB bucket
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "VENDOR_TABLE",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thank you.

Partition colum from AWS Glue Catalog becomes null after mapping

Im running a job on AWS Glue to create another catalog(B) from one previously made(A). The main problem here, both on Visual(preview) and Pyspark script is when I try to use the partitions from A, which are year, month and day. After the AppyMapping process these columns become null completely.
Visual example
And if I run the job trying to partition B through the year provided by A it shows HIVE_DEFAULT_PARTITION on S3.
I share with you the code so far.
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="clean",
table_name="sumn",
transformation_ctx="S3bucket_node1",
)
# Script generated for node Apply Mapping
ApplyMapping_node1660246155493 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("date", "date", "date", "date"),
("producto", "string", "producto", "string"),
("volumennominacion", "double", "volumennominacion", "double"),
("estacion", "string", "estacion", "string"),
("proveedor", "string", "proveedor", "string"),
("preciocompraporlitro", "double", "preciocompraporlitro", "double"),
("precioventaporlitro", "double", "precioventaporlitro", "double"),
("year", "int", "year", "int"),
("month", "int", "month", "int"),
("day", "int", "day", "int"),
],
transformation_ctx="ApplyMapping_node1660246155493",
)
job.commit()
Thanks.

How to get data from RDS postgresql to S3

I am trying to get all records from a RDS postgresql in AWS and load than in a S3 bucket as csv file.
The script i am using is:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node JDBC Connection
JDBCConnection_node1 = glueContext.create_dynamic_frame.from_catalog(
database="core",
table_name="core_public_table",
transformation_ctx="JDBCConnection_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=JDBCConnection_node1,
mappings=[
("sentiment", "string", "sentiment", "string"),
("scheduling", "long", "scheduling", "long"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={"path": "s3://path_to_s3", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()
But this job is failing due to An error occurred while calling getDynamicFrame
I read every post about in stackoverflow but I can't solve it. Can anyone please help me with this issue?