Django Naive Chat Implementation. Will it work? - django

I was asked to make a chat application on django for a special portal where only at a particular time 4 staff users can be talking to 4 selected users from a large (around 500 not all logged in at the same time) user database.
I implemented it using django and ajax with the following details ->
I used the chat model which stores all the chat messages sent to the server till now.Its fields are
userto
userfrom
message
Each user has two states :
He can wait for a chat room to get empty which he can be alloted
He can be in a chat room and talk with one of the staff users
When in a chat room , using ajax I query all messages in the chat table, and show the messages which belong to the chat. Now a request is sent to the server each 0.5 seconds to update the message list.
Note - Inside a chatroom only the messages between the two users should be displayed
When not in a chat room , The user is redirected to a normal page (which is not real time) and displays all chat rooms that can be joined.
Here the user refreshes the page and only at that time sends the request to the server , which updates the list of the chat rooms.Basically the staff talks to only 1 person at a time.
So basically the following should be the maximum load handled by the server every second ->
16 requests every second from each user(1 request every half second) to update the chat each of which goes through all the chat messages sent till now.
At max around 30-50 users will be visiting the portal at any given time , so they may send refresh requests say around 1 per every 4-5 seconds.But here the database query is very small as it has to go through only 4 rows of database ie the chatrooms.
So that gives around 30 requests to the server per second in which 16 are heavy requests which need to go through all chat messages sent by each user till now and 14 light requests which need to query only 4 rows of a database.
This will be deployed in the future on a server (RAM-256 GB upgradable to 1.5 TB , hard disk - 3*1.8 TB) and currently on a server (harddisk - 1900 GB and RAM - (around 128 GB)
I have the following 3 questions:
Will my current implementation work properly on the giver server constraints, if not I request
firstly improvements in the current system itself ( like reducing time between requests to update chat or database improvements )
Some better and easily implementable method as I don't have much time to change the complete thing
I am implementing an archive chat feature from the staff side so that , only non archived messages will be browsed through in those heavy requests every second, which will reduce the number of messages to be browsed considerably
If possible I would like to implement the database in such a way where there are separate tables for each staff and user combo. I don't know how to do this so if there is a way please point me in the right direction.
So that I just have to fetch the messages between a staff and user and just display them , and not go through all messages everytime, that should significantly reduce the load I presume.
I have closely followed the chat implementation on the following link
https://github.com/MiniGunnR/django-jquery-chat-application
https://www.youtube.com/watch?v=Z8Gjm858CWg&t=322s

Related

Optimization and loadbalancing of microservice based backend

I have a client which has a pretty popular ticket selling service, to the point that the microservice based backend is struggling to keep up, I need to come up with a solution to optimize and loadbalance the system. The infrastructure works through a series of interconnected microservices.
When a user enter the sales channels (mobile or web app), the request is directed to an AWS API Gateway which is in charge of orchestrating the communication towards the microservice in charge of obtaining the requested resources.
These resources are provided from a third party API
This third party has physical servers in each venue in charge of synchronizing the information between the POS systems and the digital sales channels.
We have a REDIS instance in charge of caching these requests that we make to the third party API, we cache each endpoint with a TTL relative to the frequency of updating the information.
Here is some background info:
We get traffic mostly from 2 major countries
On a normal day, about 100 thousands users will use the service, with an 70%/30% traffic relation in between the two countries
On important days, each country has different opening hours (Country A starts sales at 10 am UTC, but country B starts at 5 pm UTC), on these days the traffic increases some n times
We have a main MiddleWare through which all requests made by clients are processed.
We have a REDIS cache database that stores GETs with different TTLs for each endpoint.
We have a MiddleWare that decides to make the request to the cache or to the third party's API, as the case may be.
And these are the complaints I have gotten that need to be deal with:
When a country receives a high amount of requests, the country with the least traffic gets negatively affected, the clients do not respond, or respond partially because the computation layer's limit was exceeded and so the users have a bad experience
Every time the above happens, the computation layer must be manually increased from the infrastructure.
Each request has different response times, stadiums respond in +/- 40 seconds and movie theaters in 3 seconds. These requests enter a queue and are answered in order of arrival.
The error handling is not clear. The errors are mixed up and you can't tell from which country the errors are coming from and how many errors there are
The responses from the third party API are not cached correctly in the cache layer since errors are stored for the time of the TTL
I was thinking of a couple of thinks that I could suggest:
Adding in instrumentation of the requests by using AWS X-Ray
Adding in a separate table for errors in the redis cache layer (old data has to be better than no data for the end user)
Adding in AWS elastic load balancing for the main middleware
But I'm not sure how realistic would be to implement these 3 things, I'm also not sure if they would even solve the problem, I personally don't really have experience with optimizing this type of backed. I would appreciate any suggestions, recommendations, links, documentation, etc. I'm really desperate for a solution to this problem
few thoughts:
When a country receives a high amount of requests, the country with the least traffic gets negatively affected, the clients do not respond, or respond partially because the computation layer's limit was exceeded and so the users have a bad experience
A common approach in aws is to regionalize stack - assuming you are using cdk/cloud formation creating regionalized stack should be a straightforward task.
But it is a question if this will solve the problem. Your system suffers from availability issues, regionalization will isolate this problem down to regions. So we should be able to do better (see below)
Every time the above happens, the computation layer must be manually increased from the infrastructure.
AWS has an option to automatically scale up and down based on traffic patterns. This is a neat feature, given you set limits to make sure you are not overcharged.
Each request has different response times, stadiums respond in +/- 40 seconds and movie theaters in 3 seconds. These requests enter a queue and are answered in order of arrival.
It seems that the large variance is because you have to contact the servers at venues. I recommend to decouple that activity. Basically calls to venues should be done async; there are several ways you could do that - queues and customer push/pull are the approaches (please, comment if more details are needed. but this is quite standard problem - lots of data in the internet)
The error handling is not clear. The errors are mixed up and you can't tell from which country the errors are coming from and how many errors there are
That's a code fix, when you do send data to cloudwatch (do you?). You could put country as a context to all request, via filter or something. And when error is logged that context is logged as well. You probably need venue id even more than country, as you can conclude country from venue id.
The responses from the third party API are not cached correctly in the cache layer since errors are stored for the time of the TTL
Don't store errors + add a circuit breaker pattern.

Running a service (script) on a site with a specified interval

I have a website. the user is authorized, enters the site URL, then sets the interval in minutes (for example, 7 minutes). Then the user leaves the site.After 7 minutes, the program, the script, the service should start, I do not know how it's called and perform certain actions with the site that the user specified and then send the result to the mail. Tell me how can I do this service?What would it work even if the user came out and closed the browser. I can not understand in what direction I should move ... I use AWS from Amazon
UPD: let's describe in more detail. There is a login field, the user enters the login / password, the data is checked in a database called users, cookies are set with the user id (idUser), then the user enters one or more sites, they are stored in a database named data_ (idUser). The interval is stored in settings_ (idUser) value in the range 1-60 min. Suppose he sets the interval of 7 minutes. Then the user closes the tab, closes the browser. A specified interval (7 minutes) starts a script that takes data from the database data_ (idUser), (there are several URL sites stored there). The script processes them and sends the results of site verification to the mail. But the problem is also that the script will be one, and how to access the bd if I do not know idUser, because I can not get them from the cookie either ... Maybe I should change the database structure altogether?

For Facebook webhook updates, how long does it normally take for a page post to be relayed by a webhook to a callback URL?

I'm troubleshooting an issue with a Node application I've inherited serving as a webhook callback endpoint.
To debug, I'm posting messages to a page that has been subscribed by the Facebook app associated with the endpoint and following my Node app's log.
After several hours, I still see no update requests from Facebook for my page posts.
Comparing timestamps on posts to my app's logs for the last update requests it received (several days ago), I see that it appears that there was about an 8 hour lag between the post and the update request.
I've searched the documentation for help but could only find this:
Update notifications are aggregated and sent in a batch of up to 1000 updates.
If any update sent to your server fails, we will retry immediately, then try a few more times with decreasing frequency over the next 24 hours. Your server should handle deduplication in these cases. Updates unaccepted for 24 hours will be dropped.
This gives me the impression that updates are not instantaneous. But are several hour delays the norm?
Can anybody with more experience with Graph API webhooks provide a ballpark for normal lag?

Understanding social networks API limits

I couldn't figure out a very strange thing that i see in the Instagram and Twitter API limits.
It seems that your user base can't exceed a very LOW limit, and than your app will just be blocked because limits are per app .
Instagram :
Per app, you have 5000 requests per hour(Auth/or not) .
see here : http://instagram.com/developer/limits/
That means if my app created in instagram,which has client ID, is making a call on behalf of a mobile user- that will be counted as 1 call -so i can't have more than 5000 users per hour using my app with my client id ??
Twitter
from the API limit doc :
If user A launches application Z, and app Z makes 10 calls to user A’s
mention timeline in a 15 minute window, then app Z has 5 calls left to
make for that window
can find here: https://dev.twitter.com/rest/public/rate-limiting
That means that if i created an app in Twitter, and my mobile user request his time line, so i can only have 15 active users in 15 minutes ?
I dont know if i miss something big here, or that the whole API is just worthless, you can't do anything big(or medium) with 15 users in 15 minutes, or even 5000 users per hour.
I think you misinterpret things...
Instagram is stating
Authenticated Calls: 5,000 / hour per token
Unauthenticated Calls 5,000 / hour per application
As you normally will HAVE to use authenticated calls to get user information, I think 5000 per Access Token (User) per hour should be more than enough.
Twitter is stating that
Rate limiting in version 1.1 of the API is primarily considered on a per-user basis — or more accurately described, per access token in your control. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window per leveraged access token.
The rate window is considered as 15min. This doesn't mean that you can only make 15 requests per Access Token in 15min, but e.g. 15 requests to GET account/settings per Access Token per 15min. See
https://dev.twitter.com/rest/public/rate-limits

Network AAA - concurrent login accounting

I am looking for a network AAA (authentication, authorization, accounting) protocol that that manage concurrent network resource accessing from one account. An account, say, is logged in by two users concurrently, how can I distribute the session timeout of the account between the two users?
I am assuming you are not looking for the specific AAA functionality as used by telecommunications companies, but rather, RADIUS on steroids. Perhaps the easiest way to do this is to put something like FreeRADIUS on steroids.
I'll assume your particular NAS device (Wifi hub, packet gateway, etc) supports the following RADIUS records.
Access Request
Access Accept/Reject
Accounting Start
Accounting Stop
Interim Accounting
Session Disconnect
When you get a session start, let FreeRADIUS run some sort of script or log that start into the database. This is your clock start for each user. Even if the user logs in three times, you'll get start messages. When they log out for each session, you'll get a session stop. At a minimum, simply run the database and compute the deltas and apply the accounting rules to that user. If that user used 10, 20 and 30 minutes in concurrent sessions, you'll get stop records showing 10, 20 and 30 minutes.
This works, but it doesn't go quite far enough. First, if the sessions are long, you won't know about the time of those sessions until they terminate. That could be days from now. This is where the accounting records, particularly the interim accounting records come in. If your NAS supports it, you can tell it to generate an interim accounting record for a session, say, every 30 minutes. Thus, if a session lasts 30 minutes or less, you'll get the start and stop records. If a session lasts 45 minutes however, you'll get:
A start record at time 0
An interim accounting update at time 30
A stop record at time 45
It's not really the AAA you care about -- any RADIUS server likely will do the job -- FreeRADIUS, OpenRADIUS, Microsoft RADIUS server. It's your NAS device. If it can't send the records, you can't process them.