how to get the 1 million-th click of a website - web-services

I often heard this question coming from different sources, but never got a good idea of the technologies to achieve this. Can anyone shed some lights? The question is: you have a website which has high volume of users access per day. Your website is deployed in a distributed manner, have multiple webservers and load balancers responding incoming requests from lots of locations. How do you get the 1000000th user access, and show him a special page saying "congrats, you are our 1000000th visitor!". Assuming you had a distributed backend.

You could do it with jQuery, for example:
$("#linkOfInterest").click(function() { //code for updating a variable/record that contains the current number of clicks });
CSS:
a#linkOfInterest {
//style goes here
}
somewhere in the html :
<a id="linkOfInterest" href="somepage.htm"></a>

You are going to have to trade off performance or accuracy. The simplest way to do this would be have a memcached instance keep track of your visitor counts, or some other datastore with an atomic increment operation. Since there is only a single source of truth, only 1 visitor will get the message. This will delay the loading of your page by the roundtrip to the store at minimum.
If you can't afford the delay, then you will have to trade off accuracy. A distributed data store will not be able to atomically increment the field any faster than a single instance. Every web server can read and write to a local node, but another node at another datacenter may also reach 1 million users counts before the transactions are reconciled. In that case 2 or more people may get the 1 millionth user message.
It is possible to do so after the fact. Eventually, the data store will reconcile the increments, and your application can decide on a strict ordering. However, if you have already decided that a single atomic request takes too long, then this logic will take place too late to render your page.

Related

Determine unique visitors to site

I'm creating a django website with Apache2 as the server. I need a way to determine the number of unique visitors to my website (specifically to every page in particular) in a full proof way. Unfortunately users will have high incentives to try to "game" the tracking systems so I'm trying to make it full proof.
Is there any way of doing this?
Currently I'm trying to use IP & Cookies to determine unique visitors, but this system can be easily fooled with a headless browser.
Unless it's necessary that the data be integrated into your Django database, I'd strongly recommend "outsourcing" your traffic to another provider. I'm very happy with Google Analytics.
Failing that, there's really little you can do to keep someone from gaming the system. You could limit based on IP address but then of course you run into the problem that often many unique visitors share IPs (say, via a university, organization, or work site). Cookies are very easy to clear out, so if you go that route then it's very easy to game.
One thing that's harder to get rid of is files stored in the appcache, so one possible solution that would work on modern browsers is to store a file in the appcache. You'd count the first time it was loaded in as the unique visit, and after that since it's cached they don't get counted again.
Of course, since you presumably need this to be backwards compatible then of course it leaves it open to exactly the sorts of tools which are most likely to be used for gaming the system, such as curl.
You can certainly block non-browserlike user agents, which makes it slightly more difficult if some gamers don't know about spoofing browser agent strings (which most will quickly learn).
Really, the best solution might be -- what is the outcome from a visit to a page? If it is, for example, selling a product, then don't award people who have the most page views; award the people whose hits generate the most sales. Or whatever time-consuming action someone might take at the page.
Possible solution:
If you're willing to ignore people with JavaScript disabled, you could choose to count only people who access the page and then stay on that page for a given window of time (say, 1 minute). After a given period of time, do an Ajax request back to the server. So if they tried to game by changing their cookie and loading multiple tabs at once, it wouldn't work because they'd need to have the same cookie in order to register that they'd been on that page long enough. I actually think this might work; I can't honestly see a way to game that. Basically on the server side you store a dictionary called stay_until in request.session with keys for each unique page and after 1 minute or so you run an Ajax call back to the server. If the value for stay_until[page_id] is less than or equal to the current time, then they're an active user, otherwise they're not. This means that it will take someone at least 20 minutes to generate 20 unique visitors, and so long as you make the payoff worth less than the time consumed that will be a strong disincentive.
I'd even make it more explicit: on the bottom of the page in a noscript tag, put "Your access was not counted. Turn on JavaScript to be counted" with a page that lays out the tracking process.
As HTML Requests are stateless and you have no control over the users behavior on his clientside, there is no bulletproof way.
The only way you're going to be able to track "unique" visitors in a fool-proof way is to make it contingent on some controlled factor such as a login. Anything else can and will fail to be completely accurate.

Choice of storage and caching

I hope the title is chosen well enough to ask this question.
Feel free to edit if not and please accept my apologies.
I am currently laying out an application that is interacting with the web.
Explanation of the basic flow of the program:
The user is entering a UserID into my program, which is then used to access multiple xml-files over the web:
http://example.org/user/userid/?xml=1
This file contains several ID's of products the user owns in a DRM-System. This list is then used to access stats and informations about the users interaction with the product:
http://example.org/user/appid/stats/?xml=1
This also contains links to various images which are specific to that application. And those may change at any time and need to be downloaded for display in the app.
This is where the horror starts, at least for me :D.
1.) How do I store that information on the PC of the user?
I thought about using a directory for the userid, then subfolders with the appid to cache images and the xml-files to load them on demand. I also thought about using a zipfile while using the same structure.
Or would one rather use a local db like sqlite for that?
Average Number of Applications might be around ~100-300 and stats and images per app from basically 5-700.
2.) When should I refresh the content?
The bad thing is, the website from where this data is downloaded, or rather the xmls, do not contain any timestamps when it was refreshed/changed the last time. So I would need to hash all the files and compare them in the moment the user is accessing that data, which can take an inifite amount of time, because it is webbased. Okay, there are timeouts, but I would need to block the access to the content until the data is either downloaded and processed or the timeout occurs. In both cases, the application would not be accessible for a short or maybe even long time and I want to avoid that. I could let the user do the refresh manually when he needs it, but then I hoped there are some better methods for that.
Especially with the above mentioned numbers of apps and stuff.
Thanks for reading and all of that and please feel free to ask if I forgot to explain something.
It's probably worth using a DB since it saves you messing around with file formats for structured data. Remember to delete and rebuild it from time to time (or make sure old stuff is thoroughly removed and compact it from time to time, but it's probably easier to start again, since it's just a cache).
If the web service gives you no clues when to reload, then you'll just have to decide for yourself, but do be sure to check the HTTP headers for any caching instructions as well as the XML data[*]. Decide a reasonable staleness for data (the amount of time a user spends staring at the results is a absolute minimum, since they'll see results that stale no matter what you do). Whenever you download anything, record what date/time you downloaded it. Flush old data from the cache.
To prevent long delays refreshing data, you could:
visually indicate that the data is stale, but display it anyway and replace it once you've refreshed.
allow staler data when the user has a lot of stuff visible, than you do when they're just looking at a small amount of stuff. So, you'll "do nothing" while waiting for a small amount of stuff, but not while waiting for a large amount of stuff.
run a background task that does nothing other than expiring old stuff out of the cache and reloading it. The main app always displays the best available, however old that is.
Or some combination of tactics.
[*] Come to think of it, if the web server is providing reasonable caching instructions, then it might be simplest to forget about any sort of storage or caching in your app. Just grab the XML files and display them, but grab them via a caching web proxy that you've integrated into your app. I don't know what proxies make this easy - you can compile Squid yourself (of course), but I don't know whether you can link it into another app without modifying it yourself.

Desktop App w/ Database - How to handle data retrieval?

Imagine to have a Desktop application - could be best described as record keeping where the user inserts/views the records - that relies on a DB back-end which will contain large objects' hierarchies and properties. How should data retrieval be handled?
Should all the data be loaded at start-up and stored in corresponding Classes/Structures for later manipulation or should the data be retrieved only at need, stored in mock-up Classes/Structures and then reused later instead of being asked to the DB again?
As far as I can see the former approach would require a bigger memory portion used and possible waiting time at start-up (not so bad if a splash screen is displayed), while the latter could possibly subject the user to delays during processing due to data retrieval and would require to perform some expensive queries on the database, whose results and/or supporting data structures will most probably serve no purpose once used*.
Something tells me that the solution lies on an in-depth analysis which will lead to a mixture of the two approaches listed above based on data most frequently used, but I am very interested in reading your thoughts, tips and real life experiences on the topic.
For discussion's sake, I'm thinking about C++ and SQLite.
Thanks!
*assuming that you can perform on Classes/Objects faster operations rather than have to perform complicated queries on the DB.
EDIT
Some additional details:
No concurrent access to the data, meaning only 1 user works on the data which is stored locally.
Data is sent back depending on changes made humanly - i.e. with low frequency. This is not necessarily true for reading data from the DB, where I can expect to have few peaks of lots of reads which I'd like to be fast.
What I am most afraid of is the user getting the feeling of slowness when displaying a complex record (because this has to be read in from the DB).
Use Lazy Load and Data Mapper (pg.165) patterns.
I think this question depends on too many variables to be able to give a concrete answer. What you should consider first is how much data you need to read from the database in to your application. Further, how often are you sending that data back to the database and requesting new data? Also, will users be working on the data concurrently? If so, loading the data initially is probably not a good idea.
After your edits I would say it's probably better to leave the data at the database. If you are going to be accessing it with relatively low frequency there is no reason to load up or otherwise try to cache it in your application at launch. Of course, only you know your application best and should decide what bits may be loaded up front to increase performance.
You might consider to user intermediate server (WCF) that will contain cached data from the database in memory, this way users don't have to go every time to the database. Also since it is only one access point to for all users if somebody changes/added record you can update cache as well. Static data can be reloaded every x hours (for example every hour). It still might not the best option, since data needs to be marshaled from Server to the Client, but you can use netTcp binding if you can, which is fast and small.

Windows Phone 7 - Best Practices for Speeding up Data Fetch

I have a Windows Phone 7 app that (currently) calls an OData service to get data, and throws the data into a listbox. It is horribly slow right now. The first thing I can think of is because OData returns way more data than I actually need.
What are some suggestions/best practices for speeding up the fetching of data in a Windows Phone 7 app? Anything I could be doing in the app to speed up the retrieval of data and putting into in front of the user faster?
Sounds like you've already got some clues about what to chase.
Some basic things I'd try are:
Make your HTTP requests as small as possible - if possible, only fetch the entities and fields you absolutely need.
Consider using multiple HTTP requests to fetch the data incrementally instead of fetching everything in one go (this can, of course, actually make the app slower, but generally makes the app feel faster)
For large text transfers, make sure that the content is being zipped for transfer (this should happen at the HTTP level)
Be careful that the XAML rendering the data isn't too bloated - large XAML structure repeated in a list can cause slowness.
When optimising, never assume you know where the speed problem is - always measure first!
Be careful when inserting images into a list - the MS MarketPlace app often seems to stutter on my phone - and I think this is caused by the image fetch and render process.
In addition to Stuart's great list, also consider the format of the data that's sent.
Check out this blog post by Rob Tiffany. It discusses performance based on data formats. It was written specifically with WCF in mind but the points still apply.
As an extension to the Stuart's list:
In fact there are 3 areas - communication, parsing, UI. Measure them separately:
Do just the communication with the processing switched off.
Measure parsing of fixed ODATA-formatted string.
Whether you believe or not it can be also the UI.
For example a bad usage of ProgressBar can result in dramatical decrease of the processing speed. (In general you should not use any UI animations as explained here.)
Also, make sure that the UI processing does not block the data communication.

Instant search considerations

I've started working on a basic instant search tool.
This is a workflow draft.
User presses a key
Current value gets passed to the function which will make an Ajax call to a web service
Web service will run a select on a database through LINQ-To-SQL and will retrieve a list of values that match my value. I will achieve this by using SQL Like clause
Web service will return data to the function.
Function will populate relative controls through jQuery.
I have the following concerns/considerations:
Problem: Fast typists: I have typed in this sentence within few seconds. This means that on each key press I will send a request to a database. I may have 10 people doing the same thing. Server may return a list of 5 records, or it may return a list of 1000 records. Also I can hold down a key and this will send few hundred requests to a database - this can potentially slow the whole system down.
Possible solutions:
Timer where I will be able to send a request to database once every 2-4 seconds
Do not return any data unless the value is at least 3 characters long
Return a limited number of rows?
Problem: I'm not sure whether LINQ-to-SQL will cope with the potential load.
Solution: I can use stored procedures, but is there any other feasible alternatives?
I'm interested to hear if anybody else is working on a similar project and what things you have considered before implementing it.
Thank you
When to call the web service
You should only call the web service when the user is interested in suggestions. The user will only type fast if he knows what to type. So while he's typing fast, you don't have to provide suggestions to the user.
When a fast typist pauses for a short time, then he's probably interested in search suggestions. That's when you call the web service to retrieve suggestions.
Slow typists will always benefit from search suggestions, because it can save them time typing in the query. In this case you will always have short pauses between the keystrokes. Again, these short pauses are your queue to retrieve suggestions from the web service.
You can use the setTimeout function to call your web service 500 milliseconds after the user has pressed a key. If the user presses a key, you can reset the timeout using clearTimeout. This will result in a call to the web service only when the user is idle for half a second.
Performance of LINQ-to-SQL
If your query isn't too complex, LINQ-to-SQL will probably perform just fine.
To improve performance, you can limit the number of suggestions to about twenty. Most users aren't interested in thousands of suggestions anyway.
Consider using a full text catalog instead of the like clause if you are searching through blocks of text to find specific keywords. Besides being much faster, it can be configured to recognize multiple forms of the same word (like mouse and mice or leaf and leaves).
To really make your search shine, you can correct many common misspellings using the levenshtein distance to compare the search term to a list of similar terms when no matches are found.