AWS Glue: ETL job creates many empty output files

AWS Glue: ETL job creates many empty output files - amazon-web-services

I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening. I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"
# output directories
output_dir = "s3://whatever/output"
# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)
# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")
# write frame to json files
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()
The base_event_info table has 4 columns: event_id, event_name, platform, client_info
The event_details table has 2 columns: event_id, event_details
The joined table schema should look like: event_id, event_name, platform, client_info, event_details
After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table. (There are two records in the tables with the same event_id) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000, run-1540321737719-part-r-00001, etc:
198 files contain 0 bytes
2 files contain 250 bytes (each with the correct info corresponding to the enriched events)
Is this the expected behavior? Why is this job generating so many empty files? Is there something wrong with my script?

The Spark SQL module contains the following default configuration:
spark.sql.shuffle.partitions set to 200.
that's why you are getting 200 files in the first place.
You can check if this is the case by doing the following:
enriched_event_df.rdd.getNumPartitions()
if you get a value of 200 then you can change it with the number of files you want to generate with the following code:
enriched_event_df.repartition(2)
The above code will create only two files with your data.

In my experience empty output files point to an error in transformations.
You can debug these using the error functions.
Btw. why are you doing the joins using Spark DataFrames instead of DynamicFrames?

Instead of repartition, you can add column like timestamp to the dataframe through spark sql transformation step and add it as partition key while writing the dataframe to S3
For example:
select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;
use datasetdate as partitionkey while writing dynamicframe, glue job should be able to add partitions automatically

Related

How to iterate through a Glue DynamicFrame

Hi iam working AWS glue spark. I am grabbing the data from a dynamodb table and creating a dynamic frame from it. I want to be able to send all the data from that table, record by record in sqs. I have seen another suggest to convert dynamic frame to spark dataframe. But this is going to be a table with millions of records. Converting to a dataframe could take a while. I want to be able to just send all the records in the dynamic frame over to the sqs queue.
Here is my code:
sqs = boto3.resource('sqs')
sqs_queue_url = f"https://sqs.us-east-1.amazonaws.com/{account_id}/my-stream-queue"
queue = sqs.Queue(sqs_queue_url)
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job = Job(glueContext)
## #params: [JOB_NAME]
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()
df = glueContext.create_dynamic_frame.from_options("dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "1.5",
"dynamodb.splits": "500"
},
numSlots=2368)
job.commit()
# iterate over dynamic frame and send each record over the sqs queue
for record in df:
queue.send_message(MessageBody=record)

I am doing something very similar. Here is what I discovered:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="athena",
table_name=str(args['value']),
transformation_ctx="datasource0")
job.commit()
df = datasource0.toDF()
pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
message_body = generate_message(
row['bucket'], row['key'], row['version_id'])
send_message(sqs_queue, json.loads(json.dumps(message_body)))

AWS push down predicate not working when reading HIVE partitions

Trying to test out some glue functionality and the push down predicate is not working on avro files within S3 that were partitioned for use in HIVE. Our partitions are as follows: YYYY-MM-DD.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
filterpred = "loaddate == '2019-08-08'"
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hive",
table_name = "stuff",
pushDownPredicate = filterpred)
print ('############################################')
print "COUNT: ", datasource0.count()
print ('##############################################')
df = datasource0.toDF()
df.show(5)
job.commit()
However I still see glue pulling in dates way outside of the range.:
Opening 's3://data/2018-11-29/part-00000-a58ee9cb-c82c-46e6-9657-85b4ead2927d-c000.avro' for reading
2019-09-13 13:47:47,071 INFO [Executor task launch worker for task 258] s3n.S3NativeFileSystem (S3NativeFileSystem.java:open(1208)) -
Opening 's3://data/2017-09-28/part-00000-53c07db9-05d7-4032-aa73-01e239f509cf.avro' for reading
I tried using the examples in the following:
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue pushdown predicate not working properly
And currently none of the solutions proposed are working for me. I tried adding the partition column(loaddate), taking it out, quoting, unquoting, etc. Still grabs outside of the date range.

There is a syntax error in your code. The correct parameter to pass to from_catalog function is "push_down_predicate" and not "pushDownPredicate".
Sample snippet :
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)
Reference AWS Documentation : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

Seems like your partition is not in Hive naming style so you have to use a default one partition_0 in a query. Also, as suggested in another answer, the parameter is called push_down_predicate:
filterpred = "partition_0 == '2019-08-08'"
datasource0 = glue_context.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)

Make sure your code is partition properly and run in Glue crawlerto create partition table .
Run query in Athena to repair your table .
MSCK REPAIR TABLE tbl;
Run query in Athena to check partition .
SHOW PARTITIONS tbl;
Scala you can use following code
Without predicate
val datasource0 = glueContext.getCatalogSource(database = "ny_taxi_db", tableName = "taxi_tbl", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
datasource0.toDF().count()
With Predicate :
val predicate = "(year == '2016' and year_month == '201601' and year_month_day == '20160114')"
val datasource1 = glueContext.getCatalogSource(database = "ny_taxi_db",tableName = "taxi_tbl" , transformationContext = "datasource1",pushDownPredicate = predicate).getDynamicFrame() //
datasource1.toDF().count()
Python you can use following code :
Without predicate
ds = glueContext.create_dynamic_frame.from_catalog(database =
"ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx =
"datasource0" )
ds.toDF().count()
With Predicate :
ds1 = glueContext.create_dynamic_frame.from_catalog(database = "ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx = "datasource1" , push_down_predicate = "(vendorid == 1)")
ds1.toDF().count()

How do I query a JDBC database within AWS Glue using a WHERE clause with PySpark?

I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?
I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.

It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:
Assuming you created a crawler and inserted the source on your AWS Glue job like this:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case:
Schema Source Table:
Col1, Col2
After Glue job.
Schema of Destination:
Col1, Col2, Update_Date(Current Timestamp)

We do the following and works great without converting toDF()
datasource0 = glueContext.create_dynamic_frame.from_catalog(...)
from datetime import datetime
def AddProcessedTime(r):
r["jobProcessedDateTime"] = datetime.today() #timestamp of when we ran this.
return r
mapped_dyF = Map.apply(frame = datasource0, f = AddProcessedTime)

I'm not sure if there's a glue native way to do this with the DynamicFrame, but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below.
from datetime import datetime
from pyspark.sql.functions import lit
glue_df = glueContext.create_dynamic_frame.from_catalog(...)
spark_df = glue_df.toDF()
spark_df = spark_df.withColumn('some_date', lit(datetime.now()))
Some references:
Glue DynamicFrame toDF()
Spark Dataframe withColumn()

In my experience working with Glue the timezone where Glue runs is GMT. But my timezone is CDT. So, to get CDT timezone I need to convert the time within SparkContext. This specific case is to add last_load_date to the target/sink.
So I created a function.
def convert_timezone(sc):
sqlContext = SQLContext(sc)
local_time=dt.now().strftime('%Y-%m-%d %H:%M:%S')
local_time_df=sqlContext.createDataFrame([(local_time,)],['time'])
CDT_time_df = local_time_df.select(from_utc_timestamp(local_time_df['time'],'CST6CDT').alias('cdt_time'))
CDT_time=[i['cdt_time'].strftime('%Y-%m-%d %H:%M:%S') for i in CDT_time_df.collect()][0]
return CDT_time
And then call the function like ...
job_run_time = date_config.convert_timezone(sc)
datasourceDF0 = datasource0.toDF()
datasourceDF1 = datasourceDF0.withColumn('last_updated_date',lit(job_run_time))

As I have been seen there is not a properly answer to this issue I will try to explain my solution to this problem:
First thing is to clarify the withColumn function is a good way to do this but it is important to mention that this function is from the Dataframe from Spark itself and this function is not part of the glue DynamicFrame which is a own library from Glue AWS, so you need to covert the frames to do this....
First step is from the DynamicFrame get the Spark Dataframe, glue library does this with the function toDF() function, once with the Spark frame you can add the column and/or do whatever manipulation you require.
Then what we glue expect is his own frame so we need to transformed back from spark to glue proprietary frame, to do so you can use the apply function of the DynamicFrame, which requires to import the object:
import com.amazonaws.services.glue.DynamicFrame
and use the glueContext which you should already have it, like:
DynamicFrame(sparkDataFrame, glueContext)
In resume the code should looks like:
import org.apache.spark.sql.functions._
import com.amazonaws.services.glue.DynamicFrame
...
val sparkDataFrame = datasourceToModify.toDF().withColumn("created_date", current_date())
val finalDataFrameForGlue = DynamicFrame(sparkDataFrame, glueContext)
...
Note: the import org.apache.spark.sql.functions._ is to bring the current_date() function to add the column with the date.
Hope this helps....

Use Spark's current_timestamp() function:
import org.apache.spark.sql.functions._
...
val timestampedDf = source.toDF().withColumn("Update_Date", current_timestamp())
val timestamped = DynamicFrame(timestampedDf, glueContext)

You can do this supposedly with a built-in functionality now: see here...
Note to look for just the glueContext.add_ingestion_time_columns section

How to use AWS Glue / Spark to convert CSVs partitioned and split in S3 to partitioned and split Parquet

In AWS Glue's catalog, I have an external table defined with partitions that looks roughly like this in S3 and partitions for new dates are added daily:
s3://my-data-lake/test-table/
2017/01/01/
part-0000-blah.csv.gz
.
.
part-8000-blah.csv.gz
2017/01/02/
part-0000-blah.csv.gz
.
.
part-7666-blah.csv.gz
How could I use Glue/Spark to convert this to parquet that is also partitioned by date and split across n files per day?. The examples don't cover partitioning or splitting or provisioning (how many nodes and how big). Each day contains a couple hundred GBs.
Because the source CSVs are not necessarily in the right partitions (wrong date) and are inconsistent in size, I'm hoping to to write to partitioned parquet with the right partition and more consistent size.

Since the source CSV files are not necessarily in the right date, you could add to them additional information regarding collect date time (or use any date if already available):
{"collectDateTime": {
"timestamp": 1518091828,
"timestampMs": 1518091828116,
"day": 8,
"month": 2,
"year": 2018
}}
Then your job could use this information in the output DynamicFrame and ultimately use them as partitions. Some sample code of how to achieve this:
from awsglue.transforms import *
from pyspark.sql.types import *
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
import sys
import datetime
###
# CREATE THE NEW SIMPLIFIED LINE
##
def create_simplified_line(event_dict):
# collect date time
collect_date_time_dict = event_dict["collectDateTime"]
new_line = {
# TODO: COPY YOUR DATA HERE
"myData": event_dict["myData"],
"someOtherData": event_dict["someOtherData"],
"timestamp": collect_date_time_dict["timestamp"],
"timestampmilliseconds": long(collect_date_time_dict["timestamp"]) * 1000,
"year": collect_date_time_dict["year"],
"month": collect_date_time_dict["month"],
"day": collect_date_time_dict["day"]
}
return new_line
###
# MAIN FUNCTION
##
# context
glueContext = GlueContext(SparkContext.getOrCreate())
# fetch from previous day source bucket
previous_date = datetime.datetime.utcnow() - datetime.timedelta(days=1)
# build s3 paths
s3_path = "s3://source-bucket/path/year={}/month={}/day={}/".format(previous_date.year, previous_date.month, previous_date.day)
# create dynamic_frame
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path]}, format="json", format_options={}, transformation_ctx="dynamic_frame")
# resolve choices (optional)
dynamic_frame_resolved = ResolveChoice.apply(frame=dynamic_frame,choice="project:double",transformation_ctx="dynamic_frame_resolved")
# transform the source dynamic frame into a simplified version
result_frame = Map.apply(frame=dynamic_frame_resolved, f=create_simplified_line)
# write to simple storage service in parquet format
glueContext.write_dynamic_frame.from_options(frame=result_frame, connection_type="s3", connection_options={"path":"s3://target-bucket/path/","partitionKeys":["year", "month", "day"]}, format="parquet")
Did not test it, but the script is just a sample of how to achieve this and is fairly straightforward.
UPDATE
1) As for having specific file sizes/numbers in output partitions,
Spark's coalesce and repartition features are not yet implemented in Glue's Python API (only in Scala).
You can convert your dynamic frame into a data frame and leverage Spark's partition capabilities.
Convert to a dataframe and partition based on "partition_col"
partitioned_dataframe = datasource0.toDF().repartition(1)
Convert back to a DynamicFrame for further processing.
partitioned_dynamicframe = DynamicFrame.fromDF(partitioned_dataframe,
glueContext, "partitioned_df")
The good news is that Glue has an interesting feature that if you have more than 50,000 input files per partition it'll automatically group them to you.
In case you want to specifically set this behavior regardless of input files number (your case), you may set the following connection_options while "creating a dynamic frame from options":
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path], 'groupFiles': 'inPartition', 'groupSize': 1024 * 1024}, format="json", format_options={}, transformation_ctx="dynamic_frame")
In the previous example, it would attempt to group files into 1MB groups.
It is worth mentioning that this is not the same as coalesce, but it may help if your goal is to reduce the number of files per partition.
2) If files already exist in the destination, will it just safely add it (not overwrite or delete)
Glue's default SaveMode for write_dynamic_frame.from_options is to append.
When saving a DataFrame to a data source, if data/table already
exists, contents of the DataFrame are expected to be appended to
existing data.
3) Given each source partition may be 30-100GB, what's a guideline for # of DPUs
I'm afraid I won't be able to answer that. It depends on how fast it'll load your input files (size/number), your script's transformations, etc.

Import the datetime library
import datetime
Split the timestamp based on partition conditions
now=datetime.datetime.now()
year= str(now.year)
month= str(now.month)
day= str(now.day)
currdate= "s3:/Destination/"+year+"/"+month+"/"+day
Add the variable currdate in the path address in the writer class. The results will be patitioned parquet files.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue: ETL job creates many empty output files - amazon-web-services

In my experience empty output files point to an error in transformations. You can debug these using the error functions. Btw. why are you doing the joins using Spark DataFrames instead of DynamicFrames?

Related

How to iterate through a Glue DynamicFrame

AWS push down predicate not working when reading HIVE partitions

How do I query a JDBC database within AWS Glue using a WHERE clause with PySpark?

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

How to use AWS Glue / Spark to convert CSVs partitioned and split in S3 to partitioned and split Parquet

Categories

Resources