I'm currently working on an Api build with Django Rest Framework, uwsgi nginx and memcached.
I would like to know what is the best way to get users statistics like number of requests per user? Taking in consideration that the infrastructure is probably going to scale to multiple servers.
And is there is a way to determine if the response was retrieve from cache or from the application?
What I'm thinking is processing the Nginx logs to separate the request by user and apply all the calculations.
First: You might find drf-tracking to be a useful project, but it stores the response to every request in the database, which we found kind of crazy.
The solution for this that we developed instead is a a mixin that borrows heavily from drf-tracking, but that only logs statistics. This solution uses our cache server (Redis), so it's very fast.
It's very simple to drop in if you're already using Redis:
class LoggingMixin(object):
"""Log requests to Redis
This draws inspiration from the code that can be found at: https://github.com/aschn/drf-tracking/blob/master/rest_framework_tracking/mixins.py
The big distinctions, however, are that this code uses Redis for greater
speed, and that it logs significantly less information.
We want to know:
- How many queries in last X days, total?
- How many queries ever, total?
- How many queries total made by user X?
- How many queries per day made by user X?
"""
def initial(self, request, *args, **kwargs):
super(LoggingMixin, self).initial(request, *args, **kwargs)
d = date.today().isoformat()
user = request.user
endpoint = request.resolver_match.url_name
r = redis.StrictRedis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DATABASES['STATS'],
)
pipe = r.pipeline()
# Global and daily tallies for all URLs.
pipe.incr('api:v3.count')
pipe.incr('api:v3.d:%s.count' % d)
# Use a sorted set to store the user stats, with the score representing
# the number of queries the user made total or on a given day.
pipe.zincrby('api:v3.user.counts', user.pk)
pipe.zincrby('api:v3.user.d:%s.counts' % d, user.pk)
# Use a sorted set to store all the endpoints with score representing
# the number of queries the endpoint received total or on a given day.
pipe.zincrby('api:v3.endpoint.counts', endpoint)
pipe.zincrby('api:v3.endpoint.d:%s.counts' % d, endpoint)
pipe.execute()
Put that somewhere in your project, and then add the mixin to your various views, like so:
class ThingViewSet(LoggingMixin, viewsets.ModelViewSet):
# More stuff here.
Some notes about the class:
It uses Redis pipelines to make all of the Redis queries hit the server with one request instead of six.
It uses Sorted Sets to keep track of which endpoints in your API are being used the most and which users are using the API the most.
It creates a few new keys in your cache per day -- there might be better ways to do this, but I can't find any.
This should be a fairly flexible starting point for logging the API.
What I do usually is that I have a centralized cache server (Redis), I log all requests to there with all the custom calculations or fields I need. Then you can build your own dashboard.
OR
Go with Logstash from the Elasticsearch company. Very well done saves you time and scales very good. I'd say give it a shot http://michael.bouvy.net/blog/en/2013/11/19/collect-visualize-your-logs-logstash-elasticsearch-redis-kibana/
Related
I have a Django 1.11 app. There is a model Campaign where I can specify parameters to select Users. When I run Campaign, I create a CampaignRun instance with FK compaign_id and M2M users. Each time a Campaign is run, different users can be in a resulting queryset so I'd like to keep a record about it. I do it as shown below:
run = CampaignRun.objects.create(campaign=self, ...)
(...)
filtered_users = User.objects.filter(email__in=used_emails)
run.users.add(*filtered_users) # I also tried run.users.set(filtered_users)
run.save()
However, it turns out that if the campaign is run from django-admin and the resulting number of users exceeds approximately 150, the process takes more than 30 seconds, which results in Error 502: Bad Gateway.
It seems to me that 150 is ridiculously low number to get a timeout so I believe there must be a plenty of room for optimizing the process. What can I do to improve this process? What are my options in Django? Would you suggest using completely different approach (e.g. nosql)?
I've been self studying web design and want to implement something, but I'm really not sure how to accomplish it, even if I can.
The only frontend I have dealt with is angular 4, and the only backend I have dealt with is django rest framework. I have managed to get user models done in drf, and the frontend to get the user authenticated with json web tokens, and done different kinds of get and post requests.
What I want to do is on the front end have a button, when the button is hit, it will send some get request, that basically runs a text mining algorithm that will produce a list, it may take some time to fully complete, maybe in the range of 20-30 seconds, but I don't want the user to wait that long to get back the single response containing the fully compiled list.
Is it possible to say create a table in angular, and then every couple of seconds the backend sends another response containing more data, where the backend then appends the new results to that table. Something like:
00.00s | button -> GET request
01.00s drf starts analysis
05.00s drf returns the first estimated 10% of overall list
09.00s drf finds 10% more, returns estimated 20% of overall list
then repeat this process until the algorithm has stopped. The list will be very small in size, probably a list of around 20 strings, of about 15 words in each,..
I already tried in django to send multiple responses in a for loop, but the angular front end just receives the first one and then doesnt listen anymore.
No, that's not possible. For each request will be one response, not multiple.
You have two options:
- Just start your algorithm with an endpoint like /start, and check the state in an interval on an endpoint like /state
- Read about websockets or try firebase (or angularfire). This provides a two way communication
I'm building a Django application, and I would like to track certain model statistics over time (such as the number of registered users or the number of times a page has been edited). Is there a predetermined app that would do this for me, or would it be easier to roll one from scratch?
At the end of the day, I'm looking for something that can track unique values across different models over time.
The number of registered users is already available:
active_users = User.objects.filter(is_active=True).count()
for all active and inactive users it's just
active_users = User.objects.count()
To get the number of times a page has been edited, there are a couple of approaches: you could track and record each individual change (and count them) or just override the save method for the model to provide a counter of sorts.
def save(self, *args, **kwargs):
self.counter += 1
If you want to record each individual change, use a versioning tool like django-reversion (it is under active development and a ton of deployments).
You could use django-reversion for audit trail.
I have a Django project based on multiple PostgreSQL servers.
I want users to be sharded across those database servers using the same sharding logic used by Instagram:
User ID => logical shard ID => physical shard ID => database server => schema => user table
The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
Finally, the user table in the appropriate schema is queried.
How can this be achieved as simply as possible in Django ?
Ideally, I would love to be able to write Django code such as:
Fetching an instance
# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)
Fetching related objects
# this gets the user's posted articles, located in the same logical shard:
articles = user.articles
Creating an instance
# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()
Searching users by name
# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
# relevant database servers.
users = User.objects.filter(title = "King")
To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.
Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.
I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.
In short: how can I shard data in Django, as simply as possible?
Thanks a lot for your kind help.
Note
There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).
Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.
Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.
In addition, they released some code as part of a sample application demonstrating how they shard.
You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.
If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.
You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.
it is better to use professional sharding middleware, for example: Apache ShardingSphere.
The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.
I have a cronjob that runs every hours and parse 150,000+ records. Each record is summarized individually in a MySQL tables. I use two web services to retrieve the user information.
User demographic (ip, country, city etc.)
Phone information (if landline or cell phone and if cell phone what is the carrier)
Every time I get 1 record I check if I have information and if not I call these web services. After tracing my code I found out both of these calls takes 2 to 4 seconds and it makes my cronjob very slow and I can't compile statistics on time.
Is there a way to make these web service faster?
Thanks
simple:
get the data locally and use mellissa data:
for ip: http://w10.melissadata.com/dqt/websmart/ip-locator.htm
for phone: http://www.melissadata.com/fonedata.html
you can also cache them using memcache or APC which will make it faster since he does not have to request the data from the api or database.
A couple of ideas... if the same users are returning, caching the data in another table would be very helpful... you would only look it up once and have it for returning users. Upon re-reading the question it looks like you are doing that.
Another option would be to spawn new threads when you need to do the look-ups. This could be a new thread for each request, or if this is not feasible you could have n service threads ready to do the look-ups and update the results.