Not able to populate AWS Glue ETL Job metrics - amazon-web-services

I am trying to populate maximum possible Glue job metrics for some testing, below is the setup I have created:
A crawler reads data (dummy customer data of 500 rows) from a CSV file placed in an S3 bucket.
Used another crawler to crawl tables created in Redshift cluster.
An ETL job finally reads data from csv file in s3 and dumps it into a Redshift table.
The job is running without any issue and i am able to see final data getting dumped into Redshift table, however, in the end, only below 5 Cloudwatch metrics are being populated:
glue.jvm.heap.usage
glue.jvm.heap.used
glue.s3.filesystem.read_bytes
glue.s3.filesystem.write_bytes
glue.system.cpuSystemLoad
There are approximately 20 more metrics which are not getting populated.
Any suggestions on how to populate those remaining metrics as well?

Double check if the CW metrics for your job is enabled
Make sure your job runs longer say > 3mins such that it allows CW to push the metrics
For this you can add a sleep time in your code
Assuming that you are using Glue version 2.0+ for the above job, please be advised that AWS Glue version 2.0+ does not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available. Trackback on using Glue 1.0 and you should confirm that all the documented metrics are now available.

Met the same issue. Does your glue.s3.filesystem.read_bytes and glue.s3.filesystem.write_bytes have any data?
One possible reason is that the AWS Glue job metrics not emitted if job completes in less then 30 sec

While running the job enable the metrics option under monitoring tab.

Related

How to do delta load (incremental) in aws and full data refresh in dashboard?

I have multiple ERPs ingesting data in S3, I have AWS glue for spark processing.
I found out, I need to have delta type files for spark processing and best way to run this ETL on EMR or Databricks.
Should I go for Databricks for incremental load and full load refresh of dashboard?
or EMR can also manage full data refresh along with update matched and insert new data features. if yes please share some info.
What I am confused about is that, if I have only new/ updated/ deleted data to process then how dashboard will show me all previous data.

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Amazon S3 daily data processing

There is a data being stored on a s3 bucket in a daily basis, we are trying to automate parsing and processing that daily data being sent to s3 bucket, we already have the script that will parse the data, we just need to have the approach on the AWS how to automate this,the approach/use-case we thought was AWS batch that is scheduled to do the script on a daily basis or will get the latest data on that day before EOD, but seems like batch is incapable of doing it.
any ideas and approach? we've seen some approach like using Lambda and SQS/SNS
just to summarize:
data (Daily) > stored in S3 > data will be process by our team > stored to elastic search.
Thanks your ideas.
AWS Lambda is exactly what you want in this case. You can trigger lambda executing on S3 file showing up, that will process the file, and can then send it to ElasticSearch or wherever you want it to end up.
Here's an official explanation from AWS: https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
You can use Lambda + cloud watch events to execute your code on a regular schedule. You can specify a fixed rate ( or you can specify a Cron expression ), for example, in your case, you can execute your lambda every 24 hours, this way, your logic for data processing will run once daily.
Take a look at this article from AWS : Schedule AWS Lambda Functions Using CloudWatch Events

Can AWS Athena queries be run periodically (i.e., on a schedule)?

Is there any support for running Athena queries on a schedule? We want to query some data daily, and dump a summarized CSV file, but it would be best if this happened on an automated schedule.
Schedule an AWS Lambda task to kick this off, or use a cron job on one of your servers.

Periodically moving query results from Redshift to S3 bucket

I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case