My actual implementation includes Device Farm & EMR. Device Farm produces logs and saves them in S3 and I want EMR to immediately pick them up and process (ultimate goal is to put produced structured info to DynamoDB).
What's the best approach? Is it possible to do that without integration of yet another thing which checks if there are no new logs in S3?
You can use events on your S3 bucket. Create an event viz. whenever new object( log file) is created; invoke lambda or SNS notification ( which in turn invokes EMR )
Related
TechStack: salesforce data ->Aws Appflow->s3 ->databricks job
Hello! I have an appflow flow that is grabbing salesforce data and uploading it to s3 in a folder with multiple parquet files. I have an lambda that is listening to the prefix where this folder is being dropped. This lambda then triggers a databricks job which is an ingestion process I have created.
My main issue is that when these files are being uploaded to s3 it is triggering my lambda 1 time per file that is uploaded, and was curious as to how I can have the lambda run just once.
Amazon AppFlow publishes a Flow notification - Amazon AppFlow when a Flow is complete:
Amazon AppFlow is integrated with Amazon CloudWatch Events to publish events related to the status of a flow. The following flow events are published to your default event bus.
AppFlow End Flow Run Report: This event is published when a flow run is complete.
You could trigger the Lambda function when this Event is published. That way, it is only triggered when the Flow is complete.
I hope I've understood your issue correctly but it sounds like your Lambda is working correctly if you have it setup to run every time a file is dropped into the S3 bucket as the S3 trigger will call the Lambda upon every upload.
If you want to reduce the amount of time your Lambda runs is setup an Event Bridge trigger to check the bucket for new files you could run this off an Event Bridge CRON to ping the Lambda on a defined schedule. You could then send all the files to your data bricks block in bulk rather than individually.
How can I get notified when a Device Farm run is finished ?
Is it possible to get the report into s3 bucket ? So it can be use as a source trigger in CodePipeline ?
How can I get notified when a Device Farm run is finished?
One way to do that would be to have a small program with continuously calls get-run and checks the status. There are no waiters in boto3(assuming you're using this) for Device Farm at the time of writing this
https://github.com/boto/botocore/tree/develop/botocore/data/devicefarm/2015-06-23
Is it possible to get the report into s3 bucket ?
Device Farm's artifacts are already in s3 however it's in the Device Farm account and not in the account the run was scheduled with. We can see they're in s3 already from the create-upload command which returns a s3 presigned URL.
So it can be use as a source trigger in CodePipeline ?
That would be cool but this would be something the service doesn't do on our behalf at the moment. You would need to write the script to check if the run is finished, pull the artifacts, then reupload the artifacts to another s3 bucket.
Here's the links to those APIs needed in boto3
get_run
list_artifacts
upload files example
There are examples and documentation on copying data from Kafka topics to S3 but how do you copy data from S3 to Kafka?
When you read an S3 object, you get a byte stream. And you can send any byte array to Kafka with ByteArraySerializer.
Or you can parse that InputStream to some custom object, then send that using whatever serializer you can configure.
You can find one example of a Kafka Connect process here (which I assume you are comparing to Confluent's S3 Connect writer) - https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/index.html that can be configured to read binary archives or line-delimted text from S3.
Similarly, Apache Spark, Flink, Beam, NiFi, etc. simlar Hadoop related tools can read from S3 and write events to Kafka as well.
The problems with this approach is that you need to keep track of what files have been read so far, as well as handle partially read files.
Depending upon your scenario or desired frequency for uploading the objects, you can either use Lambda function on each event (e.g. every time a file is uploaded) or as a cron. This lambda works as a producer by using Kafka API and publishes to a topic.
Specifics:
The trigger for the Lambda function can be the s3:PutObject event coming from directly s3 or cloudwatch events.
You can run lambda as a cron if you don't need the objects instantly. The alternate in this case could also be running a cron on an EC2 instance which has Kafka producer and permissions to read objects from s3 and it keeps pushing them to the kafka topics.
I am new with AWS and don't know how to do the following. When I put an object in S3 I want to launch a python script that does some transformations and returns it to another path in S3. I've tried a lambda function but the process takes more than 300 seconds. I've also tried it with a Glue job but I don't know how to trigger it when I put the file in S3.
Does anyone know how to do it? Maybe I'm using the wrong AWS tools.
The simple solution for your problem is here:
Since you've already mentioned that you have AWS Glue job working to do this operation. And all you don't know is how to trigger glue job when file placed in s3, I am answering to that question.
You can write an AWS lambda using boto3 module which can be triggered based up on the s3 event and have setup glue.start_job_run command in your lambda function.
response = client.start_job_run(
JobName='string')
https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.start_job_run
Note:: I strongly believe Glue is the right tool rather than lambda for your requirement that you mentioned in question, because AWS lambda have time out limitation. It will get timeout after 300 seconds.
One option would be to use SQS:
Create the SQS queue.
Setup S3 to send notifications to the SQS queue when new objects are added to the source bucket. See Configuring Amazon S3 Event Notifications.
Setup your Python script on an EC2 instance and listen to the SQS queue in your code.
Upload the output of your Python script into the target S3 bucket after script finished.
Can you break up the Python processing into smaller steps? I'd definitely recommend that you use Lambda instead of managing EC2 if you can get your code to run within the Lambda restrictions.
I have a process that uses AWS EMR to run a pyspark cluster.
I have a S3 location where all the process logs gets stored.
I want to understand that is there a way I can filter out ERROR logs and get them mailed to my inbox. I do not want to save any log file on my system.
Is there any python library which can help me monitor real time logs. I have seen the boto3 and EMR library, but I could not find a answer to my problem from there.
The EMR logs will likely be buffered up into chunks of a few minutes or some size before being written to S3 ( but full disclosure, that's based on experience with other AWS S3 logging systems, not EMR itself).
If I were attempting to solve this problem, I'd use an AWS Lambda function to execute python that would read the S3 logs line by line and filter for the lines matching ERROR, and then use SNS to send the logs to your email address. You can use S3 events to automatically trigger the Lambda when objects are written to the S3 logging location for EMR, so this is as close to realtime as you're gonna get.
The architecture I am suggesting looks something like this
EMR -> S3 -> Lambda -> SNS -> email inbox
The write of each EMR log to s3 triggers a lambda which uses boto3
to filter the log for error messages, sending alerts to an SNS topic for distribution to users.
It may seem like a lot of moving parts but it won't require much to maintain it and should cost you only a few cents a month more than the S3 storage is already costing you. And the effort for the whole thing is actually pretty small.
Furthermore, you won't need:
a place to execute your code, servers to manage, etc
nontrivial deployment model for your project
any parts not shown above, for that matter
And you'll get for free:
Monitoring in the form of
cloudwatch metrics for lambda,
s3 logs (should you enable them)
cloudwatch logs that store your function's execution windows and stdout.
Easy integration into alerting through cloudwatch Alarms ( these typically integrate well with Pager Duty and the like )
dead-simple exensibility, such as
SNS can send SMS messages to your phone
add more parsing options in the lambda and redeploy
expose cloudwatch metrics and add alarms for thresholds
write the summary to S3 for pre signed email or sms links, or further processing now or later
You could send the email yourself through SES or just manually with python, but I would rather use SNS so that the subscriptions to the topic can vary independently from the python code.
Lambdas are a little intimidating to start with, but they'll include the boto3 sdk by default (which should obviate the need for a zipfile with pip dependencies all together ), which will simplify creation.
For that matter, you can set all this stuff up in the AWS console if you like doing things by dragging mouse pointers around, or intend to do it only a few times, or you can express all if it in cloudformation if you need something repeatable.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html
http://docs.aws.amazon.com/sns/latest/dg/welcome.html