Is amazon lambda suitable for web scraping? - amazon-web-services

If I create a function to get webpages. Will it execute it on different IP per execution so that my scraping requests dont get blocked?

I would use this AWS pipeline:
Where at source on the left you will have an EC2 instance with JAUNT which then feeds the URLS or HTML pages into a Kinesis Stream. The Lambda will do your HTML parsing and via Firehose stuff everything into S3 or Redshift.
The JAUNT can run via a standard WebProxy service with a rotating IP.

Yes, lambda by default executes with random IPs. You can trigger it using things like event bridge so you can have a schedule to execute the script every hour or similar. Others can possibly recommend using API Gateway, however, it is highly insecure to expose API endpoints available for anyone to trigger. So you have to write additional logic to protect it either by hard coded headers or say oauth.

AWS Lambda doesn't have a fixed IP source as mentioned here
however, I guess this will happen when it gets cooled down, not during the same invocation.

Your internal private IP address isn't what is seen by the web server, it's the public ip address from AWS. As long as you are setting the headers and signaling to the webserver that your application is a bot. See this link for some best practices on creating a web scrapper:
https://www.blog.datahut.co/post/web-scraping-best-practices-tips
Just be respectful of how much data you pull and how often is the main thing in my opinion.

Lambda is triggered when a file is placed in S3, or data is being added to Kinesis or DynamoDB. That is often backwards of what a web scraper needs, though certainly things like S3 could perform as a queue/job runner.
Scraping on different IPs? Certainly lambda is deployed on many machines, though that doesn't actually help you, since you can't control the machines or their IP.

Related

AWS consuming data from API REST

I'm writing to you because I'm quite a novice with AWS... I only worked before with EC2 instances for simple tasks...
I am currently looking for an AWS service for reciving data using REST API calls (to external AWS services).
So far I have used EC2 where I deployed my library (python) that made calls and stored data in S3.
What more efficient ways does AWS offer for this? some SaaS?
I know that they are still more details to know in order to choose a good services but I would like to know from where I can start looking.
Many thanks in advance :)
I make API requests using AWS Lambda. Specifically, I leave code that makes requests, writes the response to a file and pushes the response object (file) to AWS S3.
You'll need a relative/absolute path to push the files to wherever you want to ingest. By default lambda servers current working directory is: /var/task but you may want to write your files to /tmp/ instead.
You can automate the ingestion process by setting a CloudWatch rule to trigger your function. Sometimes I chain lambda functions when I need to loop requests with changing parameters instead of packing all requests within a single function,
i.e.
I leave the base request (parameterized) in one function and expose the function through an API Gateway endpoint.
I create a second function to call the base function once for each value I need by using the Event object (which is the JSON body of a regular request). This data will replace parameters within the base function.
I automate the second function.
Tip:
Lambda sometimes will run your requests inside the same server. So if you're continuously running these for testing the server may have files from past calls that you don't want, so I usually have a clean-up step at the beginning of my functions that iterates through my filesystem to make sure there are no files before making the requests.
Using python 3.8 as a runtime I use the requests module to send the request, I write the file and use boto3 to push the response object to an aws S3 bucket.
To invoke an external service you need some "compute resources" to run your client. Under compute resources in aws we understand ec2, ecs (docker container) or lambda (serverless - my favorite)
You had your code already running on EC2 so you should already know you need VPC with a public subnet and ip address to make an outbound call regardless the compute resource you choose

AWS tech stack solution for a static website

I have a project where I am building a simple single page app, that needs to pull data from an api only once a day. I have a backend that I am thinking of building with golang, where I need to do 2 things:
1) Have a scheduled job that would once a day update the DB with the new data.
2) Serve that data to the frontend. Since the data would only be updated once a day, I would like to cache it after each update.
Since, the number of options that AWS is offering is a bit overwhelming, I am wondering what would be the ideal solution for this scenario. Should I use lambda that connects to DB and updates it with a scheduled job? Should I create then a separate REST API lambda where I would pull that data from the DB and call it from the frontend?
I would really appreciate suggestions for this problem.
Her is my suggestion;
Create a lambda function
it will fetch required information from database
You may use S3 or DynamoDB to save your content. Both of the solutions may be free please check for free tier offers depending on your usage
it will save the fetched content to S3 or DynamoDB (you may check Dax for DynamoDB caching)
Create an Api gateway and integrate it to your lambda (Elastic LoadBalancer is another choice)
Create a Schedule Expressions on CloudWatch to trigger lambda daily
Make a request from your front end to Api Gateway or ELB.
You may use Route 53 for domain naming.
Your lambda should have two separate functions, one is to respond schedule expression, the other one is to serve your content via communicating with S3/DynamoDB.
Edit:
Here is the architecture
Edit:
If the content is going to be static, you may configure a S3 bucket for static site serving and your daily lambda may write it in there when it is triggered. Then you no longer need api gateway and DynamoDB.
here is the documentation for s3 static content

How would I create a Minecraft EC2 server that automaticaly starts when someone tries to use it

Currently, I have a working modded Minecraft server working running on a C5 EC2 instance. The problem is that I have to manually start and stop the server which can be annoying for my friends. I was wondering if it would be possible to automate the EC2 state so that it runs as soon as a player attempts to join the sever. This would be similar to how Minecraft realms behaves which I heard Mojang was using AWS for:
https://aws.amazon.com/blogs/aws/hosting-minecraft-realms-on-aws/
I have looked up tutorials for this and this is the best I could come across:
https://github.com/trevor-laher/OnDemandMinecraft
The problem with this solution is that it requires to make a separate website to log users in and start the EC2 instance while I want the startup and shutdown to be completely automatic.
I would appreciate any guidance.
If the server is off, it would not be possible to "connect" to the server. Therefore, another mechanism is required that can be used to start the server.
Combine that with your desire to minimise costs and the only real solution is to somehow trigger an AWS Lambda function, which could start the server.
There are a few ways you could have users trigger the AWS Lambda function:
Make a call to API Gateway
Upload an object to Amazon S3
Somehow put a message in an SNS topic or an SQS queue
Trigger an Amazon CloudWatch Alarm (which calls Lambda via SNS)
...and probably other ways
When considering a method to use, you should consider security implications such as:
Whether only authorized users should be able to trigger the Lambda function, or is it okay that anybody (eg a web crawler) might trigger it.
Whether you willing to give your friends AWS credentials (not a good idea) that they could use to start the server directly, or whether it should be an indirect method.
Frankly, I would recommend the following architecture:
Create an AWS Lambda function that turns on the server
Create an API Gateway that triggers the Lambda function
Give a URL to your friends that calls the API Gateway and passes a 'secret' (effectively a password)
The API Gateway will call the Lambda function, passing the secret
The Lambda function confirms that the secret is correct and starts the Amazon EC2 instance with Minecraft installed
Here is a tutorial that shows many of these concepts: Build an API Gateway API with Lambda Integration
The purpose of the secret is to avoid the server from starting if an unauthorized person (or a bot) happens to hit the API Gateway endpoint. They will not provide the secret, so the server will not be started.
Stopping the server after a period of non-use is a different matter. The library you referenced might be able to assist with finding a way to do this. You could have a script running on the Minecraft server that monitors the game and, after a period of inactivity, simply calls the operating system to perform a Shutdown.
You could use a BungeeCord hub server that then allows user to begin a connection to the main server and spin it up via. AWS.
This would require the bungee server to be always up however, but the cost of hosting a small bungee server should be relatively cheap.
I don't think there's any way you could do this without having a small server that receives the initial request to spin up the AWS machine.

AWS Lambda - Work with services outside Amazon services?

After reading about AWS Lambda I've taken a quite interest in it. Although there is one thing that I couldn't really find any info on. So my question is, is it possible to have lambda work outside Amazon services? Say if I have a database from some other provider, would it be possible to perform operations on it through AWS Lambda?
Yes.
AWS Lambda functions run code just like any other application. So, you can write a Lambda function that calls any service on the Internet, just like any computer. Of course, you'd need access to the external service across the Internet and access permissions.
There are some limitations to Lambda functions, such as functions only running for a maximum of five minutes and only 500MB of local disk space.
So when should you use Lambda? Typically, it's when you wish some code to execute in response to some event, such as a file being uploaded to Amazon S3, data being received via Amazon Kinesis, or a skill being activated on an Alexa device. If this fits your use-case, go for it!

AWS WAF - Auto Save Web Application Firewall logs in S3

How do you route AWS Web Application Firewall (WAF) logs to an S3 bucket? Is this something I can quickly do through the AWS Console? Or, would I have to use a lambda function (invoked by a CloudWatch timer event) to query the WAF logs every n minutes?
UPDATE:
I'm interested in the ACL logs (Source IP, URI, Matches rule, Request Headers, Action, Time, etc).
UPDATE (05/15/2017)
AWS doesn't provide an easy way to view/parse these logs. You can get a "random sample" via the get-sampled-requests command. Which isn't acceptable...
Gets detailed information about a specified number of requests--a
sample--that AWS WAF randomly selects from among the first 5,000
requests that your AWS resource received during a time range that you
choose. You can specify a sample size of up to 500 requests, and you
can specify any time range in the previous three hours.
http://docs.aws.amazon.com/cli/latest/reference/waf/get-sampled-requests.html
Also, I'm not the only one experiencing this issue either:
https://forums.aws.amazon.com/thread.jspa?threadID=220202
I was looking for this functionality today and stumbled across the referenced thread. It was, coincidentally, updated today:
Hello,
Thanks for your input. I have submitted a feature request on your
behalf to export WAF events to S3 for long term analysis.
Best Regards, albertpataws
The lack of this feature strikes me as being almost as odd as the fact that I can't change timezones for graphs.