How to run an AWS Lambda function on a regular basis that saves screenshot of the webpage of a given specific URL?
Because your question is so generic - I'm just going to provide you the resources that I used to produce a similar lambda function for myself.
Getting puppeteer into Lambda
https://github.com/alixaxel/chrome-aws-lambda
Taking screenshots in Puppeteer
https://www.scrapehero.com/how-to-take-screenshots-of-a-web-page-using-puppeteer/
Then just run your job on a cron or queue it up however you would like and then dump the screenshot into S3
Good luck!
Related
I am trying to understand the correct way to setup my project on AWS so that I ultimately get the possibility to have CI/CD on the lambda functions. And also to ingrain good practices.
My application is quite simple : an API that calls lambda functions based on users' requests.
I have deployed the application using AWS SAM. For that, I used a SAM template that was using local paths to the lambda functions' code and that created the necessary AWS ressources (API Gateway and Lambda). It was necessary to use local paths for the lambda functions because the way SAM works does not allow using existing S3 buckets for S3 events trigger (see here) and I deploy a Lambda function that is watching the S3 bucket to see any updated code to trigger lambda updates.
Now what I have to do is to push my Lambda code on Github. And have a way that Github pushes the lambda functions' code from github to the created S3 bucket during the SAM deploy and the correct prefix. Now what I would like is a way to automatically to that upon Github push.
What is the preferred way to achieve that ? I could not find clear information in AWS documentation. Also, if you see a clear flaw in my process don't hesitate to point it out.
What you're looking to do is a standard CI/CD pipeline.
The steps of your pipeline will be (more or less): Pull code from GitHub -> Build/Package -> Deploy
You want this pipeline to be triggered upon a push to GitHub, this can be done by setting up a Webhook which will then trigger the pipeline.
Last two steps are supported by SAM which I think you have already implemented before, so will be a matter of triggering the same from the pipeline.
These capabilities are supported by most CI/CD tools, if you want to keep everything in AWS you could use CodePipeline which also supports GitHub integration. Nevertheless, Jenkins is perfectly fine and suitable for your use case as well.
There are a lot of ways you can do it. So would depend eventually on how you decide to do it and what tools you are comfortable with. If you want to use native AWS tools, then Codepipeline is what might be useful.
You can use CDK for that
https://aws.amazon.com/blogs/developer/cdk-pipelines-continuous-delivery-for-aws-cdk-applications/
If you are not familiar with CDK and would prefer cloudformation, then this can get you started.
https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-github-gitclone.html
currently trying to process a number of images simultaneously using custom labels via postman. I'm a business client with AWS and have been on hold for over 30 minutes to speak with an engineer but because AWS customer sucks I'm asking the community if they can help. Rather than analyze an image one at a time, is there away to analyze images all at once? Any help would be great, really need it at this time.
Nick
I don't think there is a direct API or SDK by AWS for asynchronous image processing with custom labels.
But the right workaround here can be introducing an event-based architecture yourself.
You can upload images in batch to S3 and configure S3 events to send the event notification to an SNS topic.
You can have your API subscribed to this S3 topic which takes in the object name and bucket name. And then within the API, you have the logic to use custom labels and store results in a Database like DynamoDB. This way, you can process images asynchronously.
Just make sure you have the right inference hours configured so you don't flood your systems thus making them unavailable
Hope this process solves your problem
You can achieve this by using a batch processing solution published by AWS.
Please refer this blog for the solution: https://aws.amazon.com/blogs/machine-learning/batch-image-processing-with-amazon-rekognition-custom-labels/
Also, the solution can be deployed from github where it is published as a AWS Sample: https://github.com/aws-samples/amazon-rekognition-custom-labels-batch-processing. If you are in a region for which the deployment button is not provided, please raise a issue.
Alternatively, you can deploy this solution using SAM. The solution is developed as a AWS Serverless Application Model. So it can be deployed using sam with the following steps:
Install the sam cli - https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html
Download the code repository on your local machine
from within the folder execute the following steps. The folder name is referrenced as sam-app in the below example.
a. #Step 1 - Build your application
i. cd sam-app
ii. sam build
b. #Step 2 - Deploy your application
i. sam deploy --guided
I'd like to peform the following tasks on a regular basis (e.g. every day at 6AM) using AWS:
get new set of data using API. This dataset is updated on a daily basis.
run a python script that would process the obtained dataset by the means of several python libraries like matplotlib, pandas, plotly
automatically send the output of the script, which would be a single pdf file or a html dashboard, via email to a group of specified recipients
I know how to perform all of the above items locally - my goal is to automate this routine. I'm new to AWS and would appreciate some advice on how to perform these tasks in a straightforward way. Based on the reading I did so far, it looks like the serverless approach may be able to do the job and also reduce the complexity, but I'm not sure which functionalities exactly I should use.
For scheduling you can use aws event bridge.
You can schedule AWS lambda or AWS Step Functions both of these are serverless :).
You can have 3 lambdas
To get the data and save it in S3/dynamo (if you want to persist the data)
Processor lambda and save the report to S3.
Another lambda to send email using AWS SES which will read the report from S3 and send it.
If you don't want to use step function you can start your lambda from S3 put event or you can trigger one lambda from another lambda using aws-sdk.
So there are different approaches you can take.
First off, I would create a Lambda. You can schedule the function to run on a cron job.
If the Message you want to send is small:
I would create a SNS Topic with a email fan out.
Inside your lambda you can then transform the data and send out via SNS.
Otherwise:
I would use SES and send a mail via the SES SDK.
Some images which is already uploaded on AWS S3 bucket and of course there is a lot of image. I want to edit and replace those images and I want to do it on AWS server, Here I want to use aws lambda.
I already can do my job from my local pc. But it takes a very long time. So I want to do it on server.
Is it possible?
Unfortunately directly editing file in S3 is not supported Check out the thread. To overcome the situation, you need to download the file locally in server/local machine, then edit it and re-upload it again to s3 bucket. Also you can enable versions
For node js you can use Jimp
For java: ImageIO
For python: Pillow
or you can use any technology to edit it and later upload it using aws-sdk.
For lambda function you can use serverless framework - https://serverless.com/
I have made youtube videos long back. This is related to how get started with aws-lambda and serverless
https://www.youtube.com/watch?v=uXZCNnzSMkI
You can trigger a Lambda using the AWS SDK.
Write a Lambda to process a single image and deploy it.
Then locally use the AWS SDK to list the images in the bucket and invoke the Lambda (asynchronously) for each file using invoke. I would also save somewhere which files have been processed so you can continue if something fails.
Note that the default limit for Lambda is 1000 concurrent executions, so to avoid reaching the limit you can send messages to an SQS queue (which then triggers the Lambda) or just retry when invoke throws an error.
Given a REST API, outside of my AWS environment, which can be queried for json data:
https://someExternalApi.com/?date=20190814
How can I setup a serverless job in AWS to hit the external endpoint on a periodic basis and store the results in S3?
I know that I can instantiate an EC2 instance and just setup a cron. But I am looking for a serverless solution, which seems to be more idiomatic.
Thank you in advance for your consideration and response.
Yes, you absolutely can do this, and probably in several different ways!
The pieces I would use would be:
CloudWatch Event using a cron-like schedule, which then triggers...
A lambda function (with the right IAM permissions) that calls the API using eg python requests or equivalent http library and then uses the AWS SDK to write the results to an S3 bucket of your choice:
An S3 bucket ready to receive!
This should be all you need to achieve what you want.
I'm going to skip the implementation details, as it is largely outside the scope of your question. As such, I'm going to assume your function already is written and targets nodeJS.
AWS can do this on its own, but to make it simpler, I'd recommend using Serverless. We're going to assume you're using this.
Assuming you're entirely new to serverless, the first thing you'll need to do is to create a handler:
serverless create --template "aws-nodejs" --path my-service
This creates a service based on the aws-nodejs template on the provided path. In there, you will find serverless.yml (the configuration for your function) and handler.js (the code itself).
Assuming your function is exported as crawlSomeExternalApi on the handler export (module.exports.crawlSomeExternalApi = () => {...}), the functions entry on your serverless file would look like this if you wanted to invoke it every 3 hours:
functions:
crawl:
handler: handler.crawlSomeExternalApi
events:
- schedule: rate(3 hours)
That's it! All you need now is to deploy it through serverless deploy -v
Below the hood, what this does is create a CloudWatch schedule entry on your function. An example of it can be found over on the documentation
First thing you need is a Lambda function. Implement your logic, of hitting the API and writing data to S3 or whatever, inside the Lambda function. Next thing, you need a schedule to periodically trigger your lambda function. Schedule expression can be used to trigger an event periodically either using a cron expression or a rate expression. The lambda function you created earlier should be configured as the target for this CloudWatch rule.
The resulting flow will be, CloudWatch invokes the lambda function whenever there's a trigger (depending on your CloudWatch rule). Lambda then performs your logic.