Is there a way to publish custom metrics from AWS Glue jobs? - amazon-web-services

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics. What is the best way to monitor these metrics? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.

I have done lots of similar project like this, each micro batch can be:
a file or a bunch of file
a time interval of data from API
a partition of records from database
etc ...
Your use case is can be break down into three question:
given a bunch of input, how could you define a task_id
how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data
find a backend data store to store the metrics data
find a way to query the metrics data
In some business use case, you also need to store status information to track each of the input, are they succeeded? failed? in-progress? stuck? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)
DynamoDB is the perfect backend for this type of use case. It is a super fast, no ops, pay as you go, automatically scaling key-value store.
There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb
Here's an example:
put your glue ETL job main logic in a function:
def glue_job() -> dict:
...
return your_metrics
given an input, calculate the task id identifier, then you just need
tracker = Tracker.new(task_id)
# start the job, it will succeed
with tracker.start_job():
# do some work
your_metrics = glue_job()
# save your metrics in dynamodb
tracker.set_data(your_metrics)

Consider enabling continuous logging on your AWS Glue Job. This will allow you to do custom logging via. CloudWatch. Custom logging can include information such as row count.
More specifically
Enable continuous logging for you Glue Job
Add logger = glueContext.get_logger() at the beginning of you Glue Job
Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch. For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))
Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver.
You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.

Related

Need recommendation to create an API by aggregating data from multiple source APIs

Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
Ingest data from multiple API's which returns JSON
Store it in either S3 or DynamoDB
Modify the data to use my JSON structure
Pipe out the aggregate data as an API
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
Should I store the data on S3 or DynamoDB?
Which AWS service should I use to aggregate the data into my JSON structure?
Which service should I use to publish the aggregate data as an API, API Gateway?
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.

how to setup multiple automated workflows on aws glue

We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.

Google Dataflow and Pubsub - can not achieve exactly-once delivery

I'm trying to achieve exactly-once delivery using Google Dataflow and PubSub using Apache Beam SDK 2.6.0.
Use case is quite simple:
'Generator' dataflow job sends 1M messages to PubSub topic.
GenerateSequence
.from(0)
.to(1000000)
.withRate(100000, Duration.standardSeconds(1L));
'Archive' dataflow job reads messages from PubSub subscription and saves to Google Cloud Storage.
pipeline
.apply("Read events",
PubsubIO.readMessagesWithAttributes()
// this is to achieve exactly-once delivery
.withIdAttribute(ATTRIBUTE_ID)
.fromSubscription('subscription')
.withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
.apply("Window events",
Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.withAllowedLateness(Duration.standardMinutes(15))
.discardingFiredPanes())
.apply("Events count metric", ParDo.of(new CountMessagesMetric()))
.apply("Write files to archive",
FileIO.<String, Dto>writeDynamic()
.by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
.to(archiveDir)
.withTempDirectory(archiveDir)
.withNumShards(options.getNumShards())
.withNaming(dataSource ->
new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
));
I added 'withIdAttribute' to both Pubsub.IO.Write ('Generator' job) and PubsubIO.Read ('Archive' job) and expect that it will guarantee exactly-once semantics.
I would like to test the 'negative' scenario:
'Generator' dataflow job sends 1M messages to PubSub topic.
'Archive' dataflow job starts to work, but I stop it in the middle of processing clicking 'Stop job' -> 'Drain'. Some portion of messages has been processed and saved to Cloud Storage, let's say 400K messages.
I start 'Archive' job again and do expect that it will take unprocessed messages (600K) and eventually I will see exactly 1M messages saved to Storage.
What I got in fact - all messages are delivered (at-least-once is achieved), but on top of that there are a lot of duplicates - something in the neighborhood of 30-50K per 1M messages.
Is there any solution to achieve exactly-once delivery?
Dataflow does not enable you to persist state across runs. If you use Java you can update a running pipeline in a way that does not cause it to lose the existing state, allowing you to deduplicate across pipeline releases.
If this doesn't work for you, you may want to archive messages in a way where they are keyed by ATTRIBUTE_ID, e.g,. Spanner or GCS using this as the file name.
So, I've never done it myself, but reasoning about your problem this is how I would approach it...
My solution is a bit convoluted, but I failed to identify others way to achieve this without involving other external services. So, here goes nothing.
You could have your pipeline reading both from pubsub and GCS and then combine them to de-duplicate the data. The tricky part here is that one would be a bounded pCollection (GCS) and the other an unbounded one (pubsub). You can add timestamps to the bounded collection and then window the data. During this stage you could potentially drop GCS data older than ~15 minutes (the duration of the window in your precedent implementation). These two steps (i.e. adding timestamps properly and dropping data that is probably old enough to not create duplicates) are by far the trickiest parts.
Once this has been solved append the two pCollections and then use a GroupByKey on an Id that is common for both sets of data. This will yield a PCollection<KV<Long, Iterable<YOUR_DATUM_TYPE>>. Then you can use an additional DoFn that drops all but the first element in the resulting Iterable and also removes the KV<> boxing. From there on you can simply continue processing the data as your normally would.
Finally, this additional work should be necessary only for the first pubsub window when restarting the pipeline. After that you should re-assign the GCS pCollection to an empty pCollection so the group by key doesn't do too much additional work.
Let me know what you think and if this could work. Also, if you decide to pursue this strategy, please post your mileage :).
In the meanwhile Pub/Sub has support for Exactly once delivery.
It is currently in the pre-GA launch state, so unfortunately not ready for production use yet.

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.