Scrapy and flask on ec2 - amazon-web-services

So basically im new into this, so bear with me, please.
I have 3 python spiders that uses: scrappy,scrappy-user-agent,pandas,MongoDB.
they scrape around 150-200 pages every 12 hours and store the data locally into MongoDB collections.
And I have a flask app that connects the API endpoints with the collections and returns the data as response.
Would it possible to deploy both to same ec2 instance, or would flask and response be slowed down for users while the scrapping is done in parallel in same machine?

It is possible to deploy them both in the same instance. However, you need to know how much memory and CPU your both applications use, and choose your instance type accordingly.
Given the low frequency of your web scraping, it is very possible that it does not take much memory and CPU, but it may be the case if you are doing some heavy processing of the scrapped data.
To know about the memory and CPU configurations of each instance type: https://aws.amazon.com/ec2/instance-types/

Related

Recommended way to run a web server on demand, with auto-shutdown (on AWS)

I am trying to find the best way to architect a low cost solution to provide an on-demand web server for a certain amount of time.
The context is as follows: I have some large amount of data sitting on S3. From time to time, users will want to consult that data. I've written a Flask app that can display the data in a nice way for them. Beign poorly written, it really only accepts a single user session at the time. Currently therefore they have to download the Flask app and run it on their own machine.
I would like to find a way for users to request a cloud-based web server that would run the Flask app (through a docker container for example) on-demand, and give them access to it quickly, without having to do much if anything on their own machine.
Every user wanting to view the data would have their own web server created on demand (to avoid multiple users sharing the same web server, which wouldn't work with my Flask app)
Critically, and in order to avoid cost, the web server would terminate itself automatically after some (configurable) idle time (possibly with the Flask app informing the user that it's about to shut down, so that they can "renew" the lease).
Initially I thought that maybe AWS Fargate would be good: it can run docker instances, is quite configurable in terms of CPU/disk it can get (my Flask app is resource-hungry), and at least on paper could be used in a way that there is zero cost when users are not consulting the data (bar S3 costs naturally). But it's when it comes to the detail that I'm not sure...
How to ensure that every new user gets their own Fargate instance?
How to shut-down the instance automatically after idle time?
Is Fargate quick enough in terms of boot time?
The closest I can think is AWS App Runner. It's built on top of Fargate and it provides an intelligent scale out mechanism (probably you are not interested in this) as well as a scale to (almost) 0 capability. The way it works is that when the endpoint is solicited and it's doing work you pay for the entire fargate task (cpu/memory) you have selected in the configuration. If the endpoint is doing nothing you only pay for the memory (note the memory cost is roughly 20% of the entire cost so it's not scale to 0 but "quasi"). Checkout the pricing examples at the bottom of this page.
Please note you can further optimize costs by pausing/starting the endpoint (when it's paused you pay nothing) but in that case you need to create the logic that pauses/restarts it.
Another option you may want to explore is using Lambda this way (which would allow you to use the same container image and benefit from the intrinsic scale to 0 of Lambda). But given your comment "Lambda doesn’t have enough power, and the timeout is not flexible" may be a show stopper.

Migrating Django application from heroku (celery/redis) to aws fargate/lambda

Apologies in advance for my little knowledge of AWS
I'm trying to draw parallels between my current setup on Heroku to a move to AWS. I've run into some memory issues on Heroku because of some machine learning models I'm running and Heroku seems too expensive for my needs.
I was recommenced to move to aws using fargate which would be a better fit for my app. Below is my whole architecture, I'm hoping for some guidance on my direction of what I have and where I plan to go.
A django application running on heroku.
The base of functionality is the user uploads a video from their mobile device and uploads it to s3. A message from SNS is sent to my Heroku server that the upload is completed. The server kicks off a celery task that downloads the video from s3 and uses a machine learning model to do some natural language processing, then saves the results to my postresql database. Obviously this is very compute intensive, so I've run into some memory issues and can for-see scaling issues to come.
After lots of tweaking and attempts to no avail, I've decided to move over to AWS and leverage some of the cost benefits that I've seen in comparison to heroku of running more memory intensive tasks.
I should also mention there is a web interface involved with this django project and it isn't just a REST Api.
As far as AWS goes, I'm looking for a bit of direction. Possibly just a rough outline of the architecture I should look deeper into.
My first plan is to dockerize my application and go from there...but I'm a bit stuck on how my application fits (website, rest api, worker threads) into the AWS ecosystem.
AWS is a great fit for the application you describe. AWS Fargate / RDS will host your Django application. You have the option of using AWS Batch to handle your processing. One huge advantage is the ability to scale according to the needs of your application.
This image is one possible way to structure your application. It's a lot of work to get to this point, but AWS offers a lot of power and flexibility for reasonable costs IMO.

Redis -- how does it improve performance?

i'm relatively new to the world of web-development and have only recently learned memory hierarchies in computer systems. I recently came across Redis and am itching to try it out in a small web-app. But before I do, I was wondering how is Redis going to improve performance? From what i've read so far, it seems that Redis is an "in-memory" data store, so does that mean that whenever a user requests a data from the server, instead of fetching from the database (given that the Redis data store is already populated with the needed data) the request can be fulfilled by accessing the data directly from the server's memory? To be specific, say if i have a web-app which back-end server is hosted on AWS, and the database is stored on MLAB, then whenever a user requests a data, instead of querying to the server which redirects the request to MLAB, it can now directly fetch the data from the server without going to MLAB ? Also, by in-memory, does that mean that the data is stored in the RAM on my AWS server?
Finally, how is this different from a cache?
Thank you so much!!
Well, Redis is used as a cache, the difference with most of the traditional cache is that you have other nice structures like hashes, sets, lists, TTL on keys, hyperlologs and so on, not only pair key:value.
You are right what you define about Redis, is but take into account that if you want to move your data from MLAB database to Redis you have to design some process to keep Redis update in each update that happens in your database. So every query from your application will use Redis to get data but apart from that you will need a process to keep update Redis with changes on your database, so if you use your application to update the database (and there are no other external parts which update your DB), every time you get an update from your web-app you have to update the DB and also Redis or having a command/script which detect every time an updated happened in the DB and update Redis properly.
AWS also provides Redis services, like ElasticCache https://aws.amazon.com/elasticache/?nc1=h_ls so basically the AWS ECS instance where you have your application doesn't use the RAM but this ElasticCache service which can live on another physical machine.
Finally, Redis store on memory the data though, it uses a dump file to save partial data in case of crashes and it also offers a persistence mode

AWS celery and database

I am writing a webapp thats runs on AWS. My app requires users to upload their pdf files. I will convert them into Images using the "convert" utility in linux.
Here is my setup on Ubuntu 12.04:
Django
Celery
Django Celery
Boto
I am using apache as my webserver.
The work flow is as follows:
Three are three asynchronous tasks and two queues for handling all the processing and S3 for storing input and Output files.
A user uploads a pdf then:
accept_file_task is called: This task takes the user uploaded pdf and stores it in my S3 storage and then inserts a message into the input_queue(SQS)
check_queue_and_launch_instance_task: A periodic task that keeps monitoring the number of messages in the input_queue and launches instances whenever the queue has more messages than the no of Ec2 instances
The instances have a bootstrap script which is a while True: loop. Any of the instances can pick the message from the input_queue and do a Subprocess.Popen("convert "+input+ouput) and write the processed stated to output_queue and also upload the image generated into S3 output bucket and make it available as a download link
output_process_task: another periodic task that keeps polling the output_queue and whenever a message is available it will update the status in the table mentioned below.
I am using a model called Document to store all the status information. I also have users registering and hence a table to store all user information. Also Celery created a lot of tables to store all its task information. Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
I am unsure about the following things
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If any of you have designed something similar please help me on how to go about it
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
Keep in mind that premature optimization is the root of all evil. The question of RDS (which is really just MySQL, Oracle, or MS SQL) vs. SimpleDB is more of an application design decision than one based on scalability. SimpleDB is just a simple key-value store. RDS, on the other hand, will give you full ACID functionality. If your data is relational, then you should be using a relational database. If you just need a place to store simple strings or integers, then something like SimpleDB would make more sense.
Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
Make sure that you understand the consequences of a) creating a single point-of-failure in your design and b) SQLite's limitations compared to using a standalone RDBMS in this application. (You can use it, but it's really intended for single-user applications).
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If you're willing to replace Celery with SQS, you can tie together SQS + SNS + Cloudwatch to simplify this portion of your app. Though what you're doing doesn't sound like a bad choice, especially if it's working well already. Your time is probably better spent working on the problems in front of you rather than those that might occur down the road.

How to convert a WAMP stacked app running on a VPS to a scalable AWS app?

I have a web app running on php, mysql, apache on a virtual windows server. I want to redesign it so it is scalable (for fun so I can learn new things) on AWS.
I can see how to setup an EC2 and dump it all in there but I want to make it scalable and take advantage of all the cool features on AWS.
I've tried googling but just can't find a simple guide (note - I have no command line experience of Linux)
Can anyone direct me to detailed resources that can lead me through the steps and teach me? Or alternatively, summarise the steps in an answer so I can research based on what you say.
Thanks
AWS is growing and changing all the time, so there aren't a lot of books to help. Amazon offers training that's excellent. I took their three day class on Architecting with AWS that seems to be just what you're looking for.
Of course, not everyone can afford to spend the travel time and money to attend a class. The AWS re:Invent conference in November 2012 had a lot of sessions related to what you want, and most (maybe all) of the sessions have videos available online for free. Building Web Scale Applications With AWS is probably relevant (slides and video available), as is Dissecting an Internet-Scale Application (slides and video available).
A great way to understand these options better is by fiddling with your existing application on AWS. It will be easy to just move it to an EC2 instance in AWS, then start taking more advantage of what's available. The first thing I'd do is get rid of the MySql server on your own machine and use one offered with RDS. Once that's stable, create one or more read replicas in RDS, and change your application to read from them for most operations, reading from the main (writable) database only when you need completely current results.
Does your application keep any data on the web server, other than in the database? If so, get rid of all local storage by moving that data off the EC2 instance. Some of it might go to the database, some (like big files) might be suitable for S3. DynamoDB is a good place for things like session data.
All of the above reduces the load on the web server to just your application code, which helps with scalability. And now that you keep no state on the web server, you can use ELB and Auto-scaling to automatically run multiple web servers (and even automatically launch more as needed) to handle greater load.
Does the application have any long running, intensive operations that you now perform on demand from a web request? Consider not performing the operation when asked, but instead queueing the request using SQS, and just telling the user you'll get to it. Now have long running processes (or cron jobs or scheduled tasks) check the queue regularly, run the requested operation, and email the result (using SES) back to the user. To really scale up, you can move those jobs off your web server to dedicated machines, and again use auto-scaling if needed.
Do you need bigger machines, or perhaps can live with smaller ones? CloudWatch metrics can show you how much IO, memory, and CPU are used over time. You can use provisioned IOPS with EC2 or RDS instances to improve performance (at a cost) as needed, and use difference size instances for more memory or CPU.
All this AWS setup and configuration can be done with the AWS web console, or command-line tools, or SDKs available in many languages (Python's boto library is great). After learning the basics, look into CloudFormation to automate it better (I've written a couple of posts about that so far).
That's a bit of the 10,000 foot high view of one approach. You'll need to discover the details of each AWS service when you try to use them. AWS has good documentation about all of them.
Depending on how you look at it, this is more of a comment than it is an answer, but it was too long to write as a comment.
What you're asking for really can't be answered on SO--it's a huge, complex question. You're basically asking is "How to I design a highly-scalable, durable application that can be deployed on a cloud-based platform?" The answer depends largely on:
The specifics of your application--what does it do and how does it work?
Your tolerance for downtime balanced against your budget
Your present development and deployment workflow
The resources/skill sets you have on-staff to support the application
What your launch time frame looks like.
I run a software consulting company that specializes in consulting on Amazon Web Services architecture. About 80% of our business is investigating and answering these questions for our clients. It's a multi-week long project each time.
However, to get you pointed in the right direction, I'd recommend that you look at Elastic Beanstalk. It's a PaaS-like service that abstracts away the underlying AWS resources, making AWS easier to use for developers who don't have a lot of sysadmin experience. Think of it as "training wheels" for designing an autoscaling application on AWS.