Tracing requests of users by logging their actions to DB in django - django

I want to trace user's actions in my web site by logging their requests to database as plain text in Django.
I consider to write a custom decorator and place it to every view that I want to trace.
However, I have some troubles in my design.
First of all, is such logging mecahinsm reasonable or because of my log table will be enlarging rapidly it causes some preformance problems ?
Secondly, how should be my log table's design ?
I want to keep keywords if the user call search view or keep the item's id if the user call details of item view.
Besides, IP addresses of user's should be kept but how can I seperate users if they connect via single IP address as in many companies.
I am glad to explain in detail if you think my question is unclear.
Thanks

I wouldn't do that. If this is a production service then you've got a proper web server running in front of it, right? Apache, or nginx or something. That can do logging, and can do it well, and can write to a form that won't bloat your database, and there's a wealth of analytical tools for log analysis.
You are going to have to duplicate a lot of that functionality in your decorator, such as when you want to switch it on or off, or change the log level. The only thing you'll get by doing it all in django is the possibility of ultra-fine control, such as only logging views of blog posts with id numbers greater than X or something. But generally you'd not want that level of detail, and you'd log everything and do any stripping at the analysis phase. You've not given any reason currently why you need to do it from Django.
If you really want it in a RDBMS, reading an apache log file into Postgres or MySQL or one of those expensive ones is fairly trivial.

One thing you should keep in mind is that SQL databases don't offer you a very good writing performance (in comparison with reading), so if you are experiencing heavy loads you should probably look for a better in-memory solution (eg. some key-value-store like redis).
But keep in mind, that, especially if you would use a non-sql solution you should be aware what you want to do with the collected data (just display something like a 'log' or do some more in-deep searching/querying on the data).
If you want to identify different users from the same IP address you should probably look for a cookie-based solution (if you are using django's session framework the session's are per default identified through a cookie - so you could just simply use sessions). Another solution could be doing the logging 'asynchronously' via javascript after the page has loaded in the browser (which could give you more possibilities in identifying the user and avoid additional load when generating the page).

Related

Django - How to store all the requests/responses with the least overhead?

I'm working on a Django middleware to store all requests/responses in my main database (Postgres / SQLite).
But it's not hard to guess that the overhead will be crazy, so I'm thinking to use Redis to queue the requests for an amount of time and then send them slowly to my database.
e.g. receiving 100 requests, storing them in database, waiting to receive another 100 requests and doing the same, or something like this.
The model is like this:
url
method
status
user
remote_ip
referer
user_agent
user_ip
metadata # any important piece of data related to request/response e.g. errors or ...
created_at
updated_at
My questions are "is it a good approach? and how we can implement it? do you have any example that does such a thing?"
And the other question is that "is there any better solution"?
This doesn't suit the concrete question/answer format particularly well, unfortunately.
"Is this a good approach" is difficult to answer directly with a yes or no response. It will work and your proposed implementation looks sound, but you'll be implementing quite a bit of software and adding quite a bit of complexity to your project.
Whether this is desirable isn't easily answerable without context only you have.
Some things you'll want to answer:
What am I doing with these stored requests? Debugging? Providing an audit trail?
If it's for debugging, what does a database record get us that our web server's request logs do not?
If it's for an audit trail, is every individual HTTP request the best representation of that audit trail? Does an auditor care that someone asked for /favicon.ico? Does it convey the meaning and context they need?
Do we absolutely need every request stored? For how long? How do we handle going over our storage budget? How do we handle in edge cases like the client hanging up before getting the response, or we've processed the request but crashed before sending a response or logging the record?
Does logging a request in band with the request itself present a performance cost we actually can't afford?
Compare the strengths and weaknesses of your approach to some alternatives:
We can rely on the web server's logs, which we're already paying the cost for and are built to handle many of the oddball cases here.
We can write an HTTPLog model in band with the request using a simple middleware function, which solves some complexities like "what if redis is down but django and the database aren't?"
We write an audit logging system by providing any context needed to an out-of-band process (perhaps through signals or redis+celery)
Above all: capture your actual requirements first, implement the simplest solution that works second, and optimize only after you actually see performance issues.
I would not put this functionality in my Django application. There are many tools to do that. One of them is NGINX, which is a reverse proxy server which you can put infront of Django. Then you can use the access log from NGINX. Also, you can format those logs according to your need. Usually for this big amount of data, it is better to not store them in database, because this data will rarely be used. You can store them in a S3 bucket or just in plain files and use a log parser tool to parse them.

Advice on caching for Django/Postgres application

I am building a Django web application and I'd like some advice on caching. I know very little about caching. I've read the caching chapter in the Django book, but am struggling to relate it to my real-world situation.
My application will be a web front-end on a Postgres database containing a largeish amount of data (150GB of server logs).
The database is read-only: the purpose of the application is to give users a simple way to query the data. For example, the user might ask for all rows from server X between dates A and B.
So my database needs to support very fast read operations, but it doesn't need to worry about write operations (much - I'll add new data once every few months, and it doesn't matter how long that takes).
It would be nice if clients making the same request could use a cache, rather than making another call to the Postgres database.
But I don't know what sort of cache I should be looking at: a web cache, or a database cache. Or even if Postgres is the best choice (I'd just like to use it because it works well with Django, and is so robust). Could anyone advise?
The Django book says memcached is the best cache with Django, but it runs in memory, and the results of some of these queries could be several GB, so memcached might fill up the machine's memory quickly. But perhaps I don't fully understand how memcached operates.
Your query should in no way return several GB of data. There's no practical reason to do so, as the user cannot absorb that much data at a time. Your result set should be paged, such that the user sees only 10, 25, whatever results at a time. That then allows you to also limit your query to only fetch 10, 25, whatever records at a time starting from a particular index based on the page number.
Caching search result pages is not a particularly good idea, regardless, though. For one, the odds that different users will ever conduct exactly the same search are pretty minimal, and you'll end up wasting RAM to cache result sets that will never be used again. Also, something like logs should be real-time. If you return a cached result set, there might be new, relevant results that are not included, obscuring the usefulness of your search.
As mentioned above you have limitations on what problems caching can solve. As you are building this application, then I see no reason why you couldn't just plug in Django Haystack and Whoosh and see how it performs, then switching to some of the other more Enterprise search backends is a breeze.

Can Datomic simplify querying data contained in dynamically accessed HTML documents?

I need to write an API which would provide access to data being served as HTML documents from a web server. I need for my users to be able to perform queries over the data.
Say on a web site there is a page which lists items and their owners. Then there is additional set of profile pages for owners which for each owner provide information about their reputation. An example query I may need to answer is "Give me ID's and owners of all items submitted in 2013 whose owners have reputation of at least 10".
Given a query to answer, I need to be able to screen scrape only the parts of the web site I need for answering the query at hand. And ideally cache the obtained information for future use with new queries.
I have no problem writing the screen scraping part, but I am struggling with designing the storage/query/cache part. Is there something about Clojure/Datomic that makes it an especially suitable technology choice for this kind of processing of data? I have been pointed in this direction before.
It seems a nice challenge but not sure about a few things: a) would you like to expose to your users a Datalog query box and so make them learn datalog-like syntax? b) what exact kind of results do you wish to cache, raw DB responses, html fomatted text, json ?
Anyway I suggest you to install and play a little bit with the Datomic console to get a grasp if you didn't before as it seems to me the more close idea to what you want to achieve atm https://www.youtube.com/watch?v=jyuBnl0XQ6s http://blog.datomic.com/2013/10/datomic-console.html
For the API I suggest you to use http://clojure-liberator.github.io/liberator/ as it provides sane defaults to implement REST services and let you focus on your app behaviour

Is it feasible to have a single sign on page for multiple datasources?

I am in the beginning stages of planning a web application using ColdFusion and SQL Server 2012.
In researching the pros and cons of using multiple databases (one per customer) vs one large database, for my purposes I have decided multiple databases would be the best approach.
With this in mind I am now wondering the best way to proceed regarding logging clients in. I have two thoughts here:
I could use sub-domains with each one being for a specific client. The sub-domain also being the datasource name.
I could have a single sign on page with the datasource for this client stored in a universal users table.
I like the idea of option 2 best however I am wondering how this may work in the real world. Making each user unique would not be ideal (although I suppose I could make this off of an email address instead of a username).
I was thinking of maybe adding something along the lines of a "company code" that would need to be entered along with the username and password.
I feel like this may be asking too much of clients though.
With all of this said, would you advise going with option 1 or option 2? Would also love to hear any thoughts or ideas that may differ.
Thanks!
If you are expecting to have a large amount of data per client, it may be a good idea to split each client into their own database.
You can create a global database that contains client information, client datasource, settings, etc. for each client and then set the client database in the application.cfc.
This also makes it easier at the end if a client request their data or you would like to remove a client from the system.

comparison of ways to maintain state

There are various ways to maintain user state using in web development.
These are the ones that I can think of right now:
Query String
Cookies
Form Methods (Get and Post)
Viewstate (ASP.NET only I guess)
Session (InProc Web server)
Session (Dedicated web server)
Session (Database)
Local Persistence (Google Gears) (thanks Steve Moyer)
etc.
I know that each method has its own advantages and disadvantages like cookies not being secure and QueryString having a length limit and being plain ugly to look at! ;)
But, when designing a web application I am always confused as to what methods to use for what application or what methods to avoid.
What I would like to know is what method(s) do you generally use and would recommend or more interestingly which of these methods would you like to avoid in certain scenarios and why?
While this is a very complicated question to answer, I have a few quick-bite things I think about when considering implementing state.
Query string state is only useful for the most basic tasks -- e.g., maintaining the position of a user within a wizard, perhaps, or providing a path to redirect the user to after they complete a given task (e.g., logging in). Otherwise, query string state is horribly insecure, difficult to implement, and in order to do it justice, it needs to be tied to some server-side state machine by containing a key to tie the client to the server's maintained state for that client.
Cookie state is more or less the same -- it's just fancier than query string state. But it's still totally maintained on the client side unless the data in the cookie is a key to tie the client to some server-side state machine.
Form method state is again similar -- it's useful for hiding fields that tie a given form to some bit of data on the back end (e.g., "this user is editing record #512, so the form will contain a hidden input with the value 512"). It's not useful for much else, and again, is just another implementation of the same idea behind query string and cookie state.
Session state (any of the ways you describe) are all great, since they're infinitely extensible and can handle anything your chosen programming language can handle. The first caveat is that there needs to be a key in the client's hand to tie that client to its state being stored on the server; this is where most web frameworks provide either a cookie-based or query string-based key back to the client. (Almost every modern one uses cookies, but falls back on query strings if cookies aren't enabled.) The second caveat is that you need to put some though into how you're storing your state... will you put it in a database? Does your web framework handle it entirely for you? Again, most modern web frameworks take the work out of this, and for me to go about implementing my own state machine, I need a very good reason... otherwise, I'm likely to create security holes and functionality breakage that's been hashed out over time in any of the mature frameworks.
So I guess I can't really imagine not wanting to use session-based state for anything but the most trivial reason.
Security is also an issue; values in the query string or form fields can be trivially changed by the user. User authentication should be saved either in an encrypted or tamper-evident cookie or in the server-side session. Keeping track of values passed in a form as a user completes a process, like a site sign-up, well, that can probably be kept in hidden form fields.
The nice (and sometimes dangerous) thing, though, about the query string is that the state can be picked up by anyone who clicks on a link. As mentioned above, this is dangerous if it gives the user some authorization they shouldn't have. It's nice, though, for showing your friends something you found on the site.
With the increasing use of Web 2.0, I think there are two important methods missing from your list:
8 AJAX applications - since the page doesn't reload and there is no page to page navigation, state isn't an issue (but persisting user data must use the asynchronous XML calls).
9 Local persistence - Browser-based applications can persist their user data and state to the local hard drive using libraries such as Google Gears.
As for which one is best, I think they all have their place, but the Query String method is problematic for search engines.
Personally, since almost all of my web development is in PHP, I use PHP's session handlers.
Sessions are the most flexible, in my experience: they're normally faster than db accesses, and the cookies they generate die when the browser closes (by default).
Avoid InProc if you plan to host your website on a cheap-n-cheerful host like webhost4life. I've learnt the hard way that because their systems are over subscribed, they recycle the applications very frequently which causes your session to get lost. Very annoying.
Their suggestion is to use StateServer which is fine except you have to serialise/deserialise the session eash post back. I love objects and my web app is full of them. I'm concerned about performance when switching to StateServer. I need to refactor to only put the stuff I really need in the session.
Wish I'd know that before I started...
Cheers, Rob.
Be careful what state you store client side (query strings, form fields, cookies). Anything security-related should not be stored client-side, except maybe a session identifier if it is reasonably obscured and hard to guess. There are too many websites that have settings like "authenticated=true" and store those in a cookie or query string or hidden form field. It is trivial for a user to bypass something like that. Remember that ANY input coming from a client could have been tampered with and should not be trusted.
Signed Cookies linked to some sort of database store when you need to grab data. There's no reason to be storing data on the client side if you have a connected back-end; you're just looking for trouble if this is a public facing website.
It's not some much a question of what to use & what to avoid, but when to use which. Each has a particular circumstances when it is the best, and a different circumstance when it's the worst.
The deciding factor is generally lifetime of the data. Session state lives longer than form fields, and so on.