Is there a way to compute this amount of data and still serve a responsive website? - django

Currently I am developing a django + react website, that will (I hope) serve a decent number of users. The project demo is mostly complete, and I am starting to think about the scale required to put this thing into production
The website essentially does three things:
Grab data from external APIs (i.e. Twitter) for 50,000 unique keywords (the keywords dont change). This process happens every 30 minutes
Run computation on all of the data, and save that computation to the database. Assume that the algorithm is as optimized as possible
When a user visits the website it should serve a pretty graph/chart of all of the computational data per keyword
The issue being, this is far too intense a task to be done by the same application that serves the website, users would be waiting decades to see their data. My current plan is to have a separate API made that services the website with the data, that the website can then store in it's database. This separate API would process the data without fear of affecting users, and it should be able to finish its current computation in under 30 minutes, in time for the next round of data.
Can anyone help me understand how I can better equip my project to handle the scale? I'd love some ideas.
As a 4th year CS Student I figured it's time to put a real project out into the world and I am very excited about it and the progress I've made so far. My main worry is that the end users will be negatively effected, if I don't figure out some kind of pipeline to make this process happen.
To re-iterate my idea:
Django + React - This is the forward facing website
External API - Grabs the data off the internet and processes it, and waits for a GET request from the website
Is there a better way to do this? Or on the other hand am I severely overestimating how computationally heavy this is.
Edit: Including current research
Handling computationally intensive tasks in a Django webapp
Separation of business logic and data access in django

What you want is to have the computation task to be executed by a different process in the "background".
The most straight-forward and popular solution is to use Celery, see here.
The Celery worker(s) - which performs the background task - can either run on the same machine as the web-application or (when scale becomes an issue), you can change the configuration so that it will run on an entirely different machine.

Related

How may seconds may an API call take until I need to worry?

I have to ask a more or less non-typical SO question and hope you don't mind. Right now I am developing my very first web application. I did set up an AJAX function that requests some data from a third party API and populates my html containers with the returned data.
Right now I query one single object and populate 3 html containers with around 15 lines of Javascript code. When i activate the process/function by clicking a button on my frontend, it needs around 6-7 seconds until the html content is updated.
Is this a reasonable time? The user experience honestly will be more than bad considering that I will have to query and manipulate far more data (I build a one-site dashboard related to soccer data).
There might be very controversal answers to that question, but what would be a fair enough time for the process to run using standard infrastructure? 1-2 seconds? (I will deploy the app on heroku or digitalocean and will implement a proper caching environment to handle "regular visitors").
Right now
I use a virtualenv and django server for the development
and a demo server from the third party API which might be slowed down for whatever reason (how to check this?)
which might effect the current time needed (there will be many more variables obv.).
Looking forward to your answers.
I personally think (probably a lot people might too) 6-7 secs is a significant delay for rendering a small page. The cause of this issue might not came from django directly. Check for the following:
I use a virtualenv and django server for the development
you may be running django devserver, production server might make things bit faster (use django-debug-toolbar to find what causing the delay)
Do db index in your model.
a demo server from the third party API which might be slowed down for whatever reason
use chrome developer tools 'network' tab to watch how long that third party call takes. it might not visible there if you call api in your view.py. in that case, add some timing code there to calculate how long it takes to return.

Designing a distributed web scraper

The Problem
Lately I've been thinking about how to go about scraping the contents of a certain big, multi-national website, to get specific details about the products the company offers for sale. The website has no API, but there is some XML you can download for each product by sending a GET request with the product ID to a specific URL. So at least that's something.
The problem is that there are hundreds of millions of potential product ID's that could exist (between, say, 000000001 and 500000000), yet only a few hundred thousand products actually exist. And it's impossible to know which product ID's are valid.
Conveniently, sending a HEAD request to the product URL yields a different response depending on whether or not the product ID is valid (i.e. the product actually exists). And once we know that the product actually exists, we can download the full XML and scrape it for the bits of data needed.
Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server, so I'd like to take the opportunity to learn how to develop some sort of distributed application (totally new territory for me). At this point, I should mention that this particular website can easily handle a massive amount of incoming requests per second without risk of DOS. I'd prefer not to name the website, but it easily gets millions of hits per day. This scraper will have a negligible impact on the performance of the website. However, I'll immediately put a stop to it if the company complains.
The Design
I have no idea if this is the right approach, but my current idea is to launch a single "coordination server", and some number of nodes to communicate with that server and perform the scraping, all running as EC2 instances.
Each node will launch some number of processes, and each process will be designated a job by the coordination server containing a distinct range of potential product ID's to be scraped (e.g. product ID 00001 to 10000). These jobs will be stored in a database table on the coordination server. Each job will contain info about:
Product ID start number
Product ID end number
Job status (idle, in progress, complete, expired)
Job expiry time
Time started
Time completed
When a node is launched, a query will be sent to the coordination server asking for some configuration data, and for a job to work on. When a node completes a job, a query will be sent updating the status of the job just completed, and another query requesting a new job to work on. Each job has an expiry time, so if a process crashes, or if a node fails for any reason, another node can take over an expired job to try it again.
To maximise the performance of the system, I'll need to work out how many nodes should be launched at once, how many processes per node, the rate of HTTP requests sent, and which EC2 instance type will deliver the most value for money (I'm guessing high network performance, high CPU performance, and high disk I/O would be the key factors?).
At the moment, the plan is to code the scraper in Python, running on Ubuntu EC2 instances, possibly launched within Docker containers, and some sort of key-value store database to hold the jobs on the coordination server (MongoDB?). A relational DB should also work, since the jobs table should be fairly low I/O.
I'm curious to know from more experienced engineers if this is the right approach, or if I'm completely overlooking a much better method for accomplishing this task?
Much appreciated, thanks!
You are trying to design a distributed workflow system which is, in fact, a solved problem. Instead of reinventing the wheel, I suggest you look at AWS's SWF, which can easily do all state management for you, leaving you free to only worry about coding your business logic.
This is how a system designed using SWF will look like (Here, I'll use SWF's standard terminologies- you might have to go through the documentation to understand those exactly):
Start one workflow per productID.
1st activity will check whether this productID is valid, by making a HEAD request as you mentioned.
If it isn't, terminate workflow. Otherwise, 2nd activity will fetch relevant XML content, by making the necessary GET request, and persist it, say, in S3.
3rd activity will fetch the S3 file, scrape the XML data and do whatever with it.
You can easily change the design above to have one workflow process a batch of product IDs.
Some other points that I'd suggest you keep in mind:
Understand the difference between crawling and scraping: crawling means fetching relevant content from the website, scraping means extracting necessary data from it.
Ensure that what you are doing is strictly legal!
Don't hit the website too hard, or they might blacklist your IP ranges. You have two options:
Add delay between two crawls. This too can be easily achieved in SWF.
Use anonymous proxies.
Don't rely too much on XML results from some undocumented API, because that can change anytime.
You'll need high network performance EC2 instances. I don't think high CPU or memory performance would matter to you.

Pattern for sharing a large amount of data between the web application and a backend service in a Service Oriented Application

I have a web application which performs CRUD operations on a database. At times, I have to run a backend job to do a fair amount of number crunching/analytics on this data. This backend job will be written as a different service in a concurrent language, which will be independent of the main web application.
But actually sharing the DB between the 2 applications is probably not a best practice as it will lead to tight coupling. What is the right pattern to use here? Since this data might amount to millions of DB rows, I'm not sure using a message queue / REST APIs would be the best way to go.
This is perhaps a very common scenario and many companies/devs have already solved this problem. Any pointers will be helpful.
From the question, it would seem that the background job does not modify state of database.
Simplest way to avoid performance hit on main application, while there is a background job running, is to take database dump and perform analysis on that dump.

Designing backend software for multiplayer cross platform app

I am currently in the initial design phase of my first app.
In my app there will be individual sessions containing 1-5 users.
I need to be able to keep track of each users gps location and be able to push and pull them to each of the users. Each user will have the most recently reported location of every other user in the session.
There will be other calculations done on the data set but that will be client side, the server should only need to handle pushing and pulling of user locations (and the usernames).
I'm predicting due to the nature of the app 90% of sessions should not last more than 2 hours, with the possibility of the server ending sessions that are older then 24-48 hours (once real world testing of the app begins I would have a better idea of how long sessions should last).
I was thinking of using django to build an API, and to store all the data in the program itself and not to use a database as this should be faster and I don't think it is necessary to store the data since it has such a short lifetime.
Is this a good starting point? Is there anything I should be thinking about or considering? I'm completely new to designing backend software.
While performance might not even be an issue in the beginning, there are some things you can do once you hit a certain load:
Keep all your session data in one model, even if you're denormalizing (putting redundant information into your database) your database a bit. That way you only have to do one read to the database and no expensive JOINs
Use the Django caching framework (https://docs.djangoproject.com/en/dev/topics/cache/) to cache views, so multiple reads of the same data don't have to hit the database
Before you start optimizing, profile your code to see where your performance bottlenecks really are. Sometimes you'll be surprised which operations are expensive, and which aren't.

Speeding up page loads while using external API's

I'm building a site with django that lets users move content around between a bunch of photo services. As you can imagine the application does a lot of api hits.
for example: user connects picasa, flickr, photobucket, and facebook to their account. Now we need to pull content from 4 different apis to keep this users data up to date.
right now I have a a function that updates each api and I run them all simultaneously via threading. (all the api's that are not enabled return false on the second line, no it's not much overhead to run them all).
Here is my question:
What is the best strategy for keeping content up to date using these APIs?
I have two ideas that might work:
Update the apis periodically (like a cron job) and whatever we have at the time is what the user gets.
benefits:
It's easy and simple to implement.
We'll always have pretty good data when a user loads their first page.
pitfalls:
we have to do api hits all the time for users that are not active, which wastes a lot of bandwidth
It will probably make the api providers unhappy
Trigger the updates when the user logs in (on a pageload)
benefits:
we save a bunch of bandwidth and run less risk of pissing off the api providers
doesn't require NEARLY the amount of resources on our servers
pitfalls:
we either have to do the update asynchronously (and won't have
anything on first login) or...
the first page will take a very long time to load because we're
getting all the api data (I've measured 26 seconds this way)
edit: the design is very light, the design has only two images, an external css file, and two external javascript files.
Also, the 26 seconds number comes from the firebug network monitor running on a machine which was on the same LAN as the server
Personally, I would opt for the second method you mention. The first time you log in, you can query each of the services asynchronously, showing the user some kind of activity/status bar while the processes are running. You can then populate the page as you get the results back from each of the services.
You can then cache the results of those calls per user so that you don't have to call the apis each time.
That lightens the load on your servers, loads your page fast, and provides the user with some indication of activity (along with incrimental updates to the page as their content loads). I think those add up to the best User Experience you can provide.