Determine unique visitors to site - django

I'm creating a django website with Apache2 as the server. I need a way to determine the number of unique visitors to my website (specifically to every page in particular) in a full proof way. Unfortunately users will have high incentives to try to "game" the tracking systems so I'm trying to make it full proof.
Is there any way of doing this?
Currently I'm trying to use IP & Cookies to determine unique visitors, but this system can be easily fooled with a headless browser.

Unless it's necessary that the data be integrated into your Django database, I'd strongly recommend "outsourcing" your traffic to another provider. I'm very happy with Google Analytics.
Failing that, there's really little you can do to keep someone from gaming the system. You could limit based on IP address but then of course you run into the problem that often many unique visitors share IPs (say, via a university, organization, or work site). Cookies are very easy to clear out, so if you go that route then it's very easy to game.
One thing that's harder to get rid of is files stored in the appcache, so one possible solution that would work on modern browsers is to store a file in the appcache. You'd count the first time it was loaded in as the unique visit, and after that since it's cached they don't get counted again.
Of course, since you presumably need this to be backwards compatible then of course it leaves it open to exactly the sorts of tools which are most likely to be used for gaming the system, such as curl.
You can certainly block non-browserlike user agents, which makes it slightly more difficult if some gamers don't know about spoofing browser agent strings (which most will quickly learn).
Really, the best solution might be -- what is the outcome from a visit to a page? If it is, for example, selling a product, then don't award people who have the most page views; award the people whose hits generate the most sales. Or whatever time-consuming action someone might take at the page.
Possible solution:
If you're willing to ignore people with JavaScript disabled, you could choose to count only people who access the page and then stay on that page for a given window of time (say, 1 minute). After a given period of time, do an Ajax request back to the server. So if they tried to game by changing their cookie and loading multiple tabs at once, it wouldn't work because they'd need to have the same cookie in order to register that they'd been on that page long enough. I actually think this might work; I can't honestly see a way to game that. Basically on the server side you store a dictionary called stay_until in request.session with keys for each unique page and after 1 minute or so you run an Ajax call back to the server. If the value for stay_until[page_id] is less than or equal to the current time, then they're an active user, otherwise they're not. This means that it will take someone at least 20 minutes to generate 20 unique visitors, and so long as you make the payoff worth less than the time consumed that will be a strong disincentive.
I'd even make it more explicit: on the bottom of the page in a noscript tag, put "Your access was not counted. Turn on JavaScript to be counted" with a page that lays out the tracking process.

As HTML Requests are stateless and you have no control over the users behavior on his clientside, there is no bulletproof way.

The only way you're going to be able to track "unique" visitors in a fool-proof way is to make it contingent on some controlled factor such as a login. Anything else can and will fail to be completely accurate.

Related

Passing more than 8000 Post parameters throws error

I am working on a module which requires to submit a form with an insane amount of parameters (8k-10k). I am not sure whether this is a good idea or not. But that's the way it is. I have changed the settings in neo-runtime.xml file as mentioned in this link as bellow:
<var name='postParametersLimit'><number>10000.0</number></var> and restarted the server. But no use. CF still throws error 500. We can not see any robust information. I am working on CF9.0.2 and we are using IIS 7.5. Is there anything do i need to do?
"We gave our client a dynamic form where he can add his own form fields and now we have this problem. There was a mismatch between clients expectations and our thinking of the way client wants it."
Unfortunately, you're going to have to tell the client they can't have it how they want it. That post processing limit is there for security reasons and if you raise it too high, then you're re-opening your server to a denial of service attack using a hash algorithm collision.
We have tens of thousands of forms in our workflow system and work with banking and government clients. Once this update was applied (in development first), we had to raise the default to a certain value and stick with it. We made sure to note this limitation to the entire business team and add it to our coding standards document to ensure that all new development was done in accordance to the standard. After reworking a handful of existing forms to account for the limitation, we were able to push the security update to production without a problem.
Just tell them that there is a security restriction on the number of fields in a single form and they cannot cross that line. If you need to gather that much data, they'll have to break it up into multiple forms.
You can use a cfgrid instead of using a long form with huge amount of data to take input from the user.
cfgrid allows you to load only a limited amount of data from the database.
Using it you can prevent posting and loading of huge amount of data at once.
And if you are not a great supporter of cfgrid of cfajax features you can still use pagination or stuff like that, that will allow you to load a limited amount of data in your form and in turn less posting of data. But the later will need you to build a logic by yourself.
Start with the CF server limits first. This blog post should give you a pointer to where limits can be adjusted:
http://www.cutterscrossing.com/index.cfm/2012/3/27/ColdFusion-Security-Hotfix-and-Big-Forms

Can the _ga cookie value be used to exclude self traffic in Google analytics universal?

Google analytics store a unique user id in a cookie names _ga. Some self traffic was already counted, and I was wondering if there's a way to filter it out by providing the _ga cookie value to some exclusion filter.
Any ideas?
Firstly, I'm gonna put it out there that there is no solution for excluding or removing historical data, except to make a filter or segment for your reports, which doesn't remove or prevent that data from showing up; it simply hides it. So if you're looking for something that gets rid of the data that is already there, sorry, not happening. Now on to making sure more data doesn't show up.
GA does not offer a way to exclude traffic by its visitor cookie (or any cookie in general). In order to do this, you will need to read the cookie yourself and expose it to something that GA can exclude by. For example, you can pop a custom variable or override/append the page name.
But this isn't really that convenient for lots of reasons, such as having to burn a custom variable slot, or having to write some server-side or client-side code to read the cookie and act on a value, etc..
And even if you do decide to do this, you're going to have to consider at least 2 scenarios in which this won't work or break:
1) This won't work if you go to your site from a different browser, since browsers can't read each other's cookies.
2) It will break as soon as you clear your cookies.
An alternative you should consider is to make an exclusion filter on your IP address. This has the benefit of:
works regardless of which browser you are on
you don't have to write any code for it that burns or overwrites any variables
you don't have to care about the cookie
General Notes
I don't presume to know your situation or knowledge, so take this for what it's worth: simply throwing out general advice because nothing goes without saying.
If you were on your own site to QA something and are wanting to remove data from some kind of development or QA efforts, a better solution in general is to have a separate environment for developing and QAing stuff. For example a separate dev.yoursite.com subdomain that mirrors live. Then you can easily make an exclusion on that subdomain (or have a separate view or property or any number of ways to keep that dev/qa traffic out).
Another thing is.. how much traffic are we talking here anyway? You really shouldn't be worrying about a few hits and whatnot that you've personally made on your live site. Again, I don't know your full situation and what your normal numbers look like, but in the grand scheme of things, a few extra hits is a drop of water in a bucket, and in general it's more useful to look at trends in the data over time, not exact numbers at a given point in time.

how to get the 1 million-th click of a website

I often heard this question coming from different sources, but never got a good idea of the technologies to achieve this. Can anyone shed some lights? The question is: you have a website which has high volume of users access per day. Your website is deployed in a distributed manner, have multiple webservers and load balancers responding incoming requests from lots of locations. How do you get the 1000000th user access, and show him a special page saying "congrats, you are our 1000000th visitor!". Assuming you had a distributed backend.
You could do it with jQuery, for example:
$("#linkOfInterest").click(function() { //code for updating a variable/record that contains the current number of clicks });
CSS:
a#linkOfInterest {
//style goes here
}
somewhere in the html :
<a id="linkOfInterest" href="somepage.htm"></a>
You are going to have to trade off performance or accuracy. The simplest way to do this would be have a memcached instance keep track of your visitor counts, or some other datastore with an atomic increment operation. Since there is only a single source of truth, only 1 visitor will get the message. This will delay the loading of your page by the roundtrip to the store at minimum.
If you can't afford the delay, then you will have to trade off accuracy. A distributed data store will not be able to atomically increment the field any faster than a single instance. Every web server can read and write to a local node, but another node at another datacenter may also reach 1 million users counts before the transactions are reconciled. In that case 2 or more people may get the 1 millionth user message.
It is possible to do so after the fact. Eventually, the data store will reconcile the increments, and your application can decide on a strict ordering. However, if you have already decided that a single atomic request takes too long, then this logic will take place too late to render your page.

Choice of storage and caching

I hope the title is chosen well enough to ask this question.
Feel free to edit if not and please accept my apologies.
I am currently laying out an application that is interacting with the web.
Explanation of the basic flow of the program:
The user is entering a UserID into my program, which is then used to access multiple xml-files over the web:
http://example.org/user/userid/?xml=1
This file contains several ID's of products the user owns in a DRM-System. This list is then used to access stats and informations about the users interaction with the product:
http://example.org/user/appid/stats/?xml=1
This also contains links to various images which are specific to that application. And those may change at any time and need to be downloaded for display in the app.
This is where the horror starts, at least for me :D.
1.) How do I store that information on the PC of the user?
I thought about using a directory for the userid, then subfolders with the appid to cache images and the xml-files to load them on demand. I also thought about using a zipfile while using the same structure.
Or would one rather use a local db like sqlite for that?
Average Number of Applications might be around ~100-300 and stats and images per app from basically 5-700.
2.) When should I refresh the content?
The bad thing is, the website from where this data is downloaded, or rather the xmls, do not contain any timestamps when it was refreshed/changed the last time. So I would need to hash all the files and compare them in the moment the user is accessing that data, which can take an inifite amount of time, because it is webbased. Okay, there are timeouts, but I would need to block the access to the content until the data is either downloaded and processed or the timeout occurs. In both cases, the application would not be accessible for a short or maybe even long time and I want to avoid that. I could let the user do the refresh manually when he needs it, but then I hoped there are some better methods for that.
Especially with the above mentioned numbers of apps and stuff.
Thanks for reading and all of that and please feel free to ask if I forgot to explain something.
It's probably worth using a DB since it saves you messing around with file formats for structured data. Remember to delete and rebuild it from time to time (or make sure old stuff is thoroughly removed and compact it from time to time, but it's probably easier to start again, since it's just a cache).
If the web service gives you no clues when to reload, then you'll just have to decide for yourself, but do be sure to check the HTTP headers for any caching instructions as well as the XML data[*]. Decide a reasonable staleness for data (the amount of time a user spends staring at the results is a absolute minimum, since they'll see results that stale no matter what you do). Whenever you download anything, record what date/time you downloaded it. Flush old data from the cache.
To prevent long delays refreshing data, you could:
visually indicate that the data is stale, but display it anyway and replace it once you've refreshed.
allow staler data when the user has a lot of stuff visible, than you do when they're just looking at a small amount of stuff. So, you'll "do nothing" while waiting for a small amount of stuff, but not while waiting for a large amount of stuff.
run a background task that does nothing other than expiring old stuff out of the cache and reloading it. The main app always displays the best available, however old that is.
Or some combination of tactics.
[*] Come to think of it, if the web server is providing reasonable caching instructions, then it might be simplest to forget about any sort of storage or caching in your app. Just grab the XML files and display them, but grab them via a caching web proxy that you've integrated into your app. I don't know what proxies make this easy - you can compile Squid yourself (of course), but I don't know whether you can link it into another app without modifying it yourself.

How to encourage non-anonymous editing on MediaWiki?

Problem
At work we have a department wiki (running Mediawiki). Unfortunately several
persons edit without logging in, and that makes it very difficult to track
down editors to ask questions about the content.
There are two strategies to improve this
encourage logged in editing
discourage anonymous editing.
Encouraging
For this part, any tips are welcome. But of course there is always risks involved
in rewarding behaviours.
Discourage
I know that this must be kept low or else it will discourage any editing.
But something just slightly annoying would be nice to have.
[update]
I know it is possible to just disallow anonymous editing, but that will put a high barrier to any first time contribution (especially for people outside our department!), so I do not think that is an option.
[/update]
[update2]
Using LDAP or Active Directory does not solve the problem since the wiki is also accessible and used by external contractors.
[/update2]
[update3]
I am no longer working for this company. That does not mean that I completely have lost interest in this question, but from my current interest point the most valuable part is the "Did you forget to log in?" part below, and I will accept answers based on this part of the question.
[/update3]
Confirmation
One thought was to have an additional confirmation step for anonymous users -
"Are you really sure you want to submit this anonymously?", although with
such a question there is a risk that people will give up or resist editing. However,
if that question is re-phrased in a more diplomatic way as "Did you forget
to log in?" I think it will appear as much more acceptable. And besides that
will also capture those situations where the author did in fact forget to
log in, but actually would want to have his/her contributions credited
his/her user. This last point is by itself a good enough reason for wanting it.
Is this possible?
Delay
Another thought for something to be slightly annoying is to add an extra
forced delay after "save page" displaying something like "If you had logged
in you would not have to wait x seconds". Selecting a right x is difficult
because if it is to high it will be a barrier and if it too low might not
make any difference. But then I started thinking, what about starting at
zero and then add one second delay for each anonymous edit by a given IP
address in a given time frame? That way there will be no barrier for
starting to use the wiki, and by the time the delay is getting significant
the user has already contributed a lot so I think the outcome is much
more likely to be that the editor eventually creates a user rather than
giving up. This assumes IP addresses are rather static, but that is very
typically is the case in a business network.
Is this possible?
You can Turn off Anonymous Editing in Mediawiki like so:
Edit LocalSettings.php and add the following setting:
$wgDisableAnonEdit = true;
Edit includes/SkinTemplate.php, find $fname-edit and change the code to look like this (i.e., basically wrap the following code between the wfProfileIn() and wfProfileOut() functions):
wfProfileIn( "$fname-edit" );
global $wgDisableAnonEdit;
if ( $wgUser->mId || !$wgDisableAnonEdit) {
// Leave this as is
}
wfProfileOut( "$fname-edit" );
Next, you may want to disable the [Edit] links on sections. To do this, open includes/Skin.php and search for editsection. You will see something like:
if (!$wgUser->getOption( 'editsection' ) ) {
Change that to:
global $wgDisableAnonEdit;
if (!$wgUser->getOption( 'editsection' ) || !$wgDisableAnonEdit ) {
Section editing is now blocked for anonymous users.
Forbid anonymous editing and let people log in using their domain logins (LDAP). Often the threshold is the registering of a new user and making up username and password and such.
I think you should discourage anonymous edits by forbidding them - it's an internal wiki, after all.
The flipside is you must make the login process as easy as possible. Hopefully you can configure the login cookie to have a decent length (like 1 month) so they only need to login once per month.
Play to the people's egos, and add a rep system kind of like here. Just make a widget for the home page that shows the number of edits made by the top 5 users or something. Give the top 1 or 2 users a MVP reward at regular (monthly?) intervals.
Well, I doubt that this solution will be valuable for hlovdal, given that this question is now two months old, but maybe somebody else will find it useful:
The optimum solution to this problem is to enable automatic logins. This requires two steps. First, you need to add automatic authentication to your web service. Right now, we're using Apache with the Debian usn-libapache2-authenntlm-perl package on our internal application server*. (Our network is Active Directory and, obviously, the server runs on Debian Linux.) Second, you need a MediaWiki extension that makes MediaWiki aware of the web service's authentication. I've used the Automatic REMOTE_USER Authentication module successfully on an Apache web server that was tied into our network via an NTLM authentication module, but I do recall that it required a bit of massaging the code to make it work:
I had to follow the "horrid hacks" given on the extension's page, changing the setPassword() and addUser() functions to always return true instead of always returning false.
Since Active Directory is case-insensitive and MediaWiki isn't, I replaced both instances of the statement $username = $_SERVER['REMOTE_USER'] with $username = getCanonicalName($_SERVER['REMOTE_USER']).
Since I wanted to only allow certain people within the company to use our wiki, I set autoCreate() to always return false. It doesn't sound as if you need to worry about this, so you should leave autoCreate() at always returning true, which means that anybody on your company network will be able to access the wiki.
The nifty thing about this solution is that nobody has to log in into the wiki, ever; they simply go to a wiki page and they are logged in under their network ID.
* We just switched to this from a Red Hat server that was using mod_ntlm. Unfortunately, mod_ntlm hasn't been updated in a while and it's been starting to sporadically fail. I mention this because I've started to stumble on a performance issue with our current MediaWiki configuration that may require further code massaging....
Make sure users don't get logged out if they look away from the screen or sneeze or scratch their head. You want long, persistent, sessions. Once logged in, stay logged in.
That's the problem with the MediaWiki our company is using internally - you log in, do stuff, then come back later and it logged you out, but the notification of not being logged in anymore is so insignificant on the screen that the user never notices.
If this runs within an internal network, you could pull Active Directory information so that no one has to log in, ever. That's how I do it at work. That is, if they are logged into their windows machine, then my webapps can pick up their username and associate that (or their userid) with their edits.
I don't know if this would be easy to add to MediaWiki, though.
I'd recommend checking out wikipatterns.org - a great site about the social aspects of wikis
Explicitly using some form of directory service (LDAP) would probably be a good idea, so that your users are always fully identified. On the other hand, wikis are subject to their own dynamics, in fact some wikis are so successful because they can be anonymously edited, so that's another thing to keep in mind.
Apart from that, personally I'd try to create some sort of incentive for users to contribute openly and identifiable: this could be based on a point/score system so that there are stats shown for all users who have contributed to the wiki each day, this could possibly even create some sort of competition.
Likewise, the wiki could by default not show any anonymously contributed contents without them being reviewed first, which would be another incentive for users to contribute openly.
SO has an extremely low barrier for posting. You could allow people to specify their name when making an edit. When they are ready, they can finally log in to avoid having to type their name all the time.
You said this is in a departmental situation. Can't you add a feature to the wiki where it makes an educated guess as to who is editing based on the IP address, and annotates the edit accordingly?
I agree absolutely with everyone who recommends carefully researching the effects of anonymity in your application before you start "forbidding" it. In a great many cases people prefer anonymous editing because they DO NOT WANT TO BE ASKED ABOUT IT, IDENTIFIED WITH IT, OR SUFFER SOME PROBLEM FOR POINTING IT OUT. You need to be VERY sure these factors are not driving users to prefer anonymous edits, and frankly you should continue to allow anonymized edits with a generic credential login like "anonymous_employee" or "anonymous_contractor", in case someone wants to point out an issue without becoming identified with it.
Re the "thought... to have an additional confirmation step for anonymous users- "Are you really sure you want to submit this anonymously?", it's a good idea, but do not "re-phrase" in a way that suggests it is wrong to not be logged in as yourself, i.e. don't say "Did you forget to log in?" I'd instead note it this way:
"Your edit will appear as an IP number - it may be attributed to 'anonymous_employee' or 'anonymous_contractor' or 'anonymous_contributor' for your privacy protection. You will not be notified of any answer or response to it. If you prefer to have this contribution credited, then [log in right now]."
That leaves it absolutely clear what will happen, doesn't pressure anyone to do it either way, and does not bias what is being contributed with some "rewards".
You can also, alternately, force a login via LDAP / cookies, and then ask them if they prefer this edit to be anonymous. That is the approach taken on some blog platforms. In an intranet the abuse potential for this is basically zero, so you would presumably only have situations where someone didn't want 'how they knew' or 'why they raised this' to be the question rather than the data itself... IBM has shown in some careful research that anonymized feedback is very much more useful than attributed in correcting groupthink & management blind sides.