Instant search considerations

Instant search considerations - web-services

I've started working on a basic instant search tool.
This is a workflow draft.
User presses a key
Current value gets passed to the function which will make an Ajax call to a web service
Web service will run a select on a database through LINQ-To-SQL and will retrieve a list of values that match my value. I will achieve this by using SQL Like clause
Web service will return data to the function.
Function will populate relative controls through jQuery.
I have the following concerns/considerations:
Problem: Fast typists: I have typed in this sentence within few seconds. This means that on each key press I will send a request to a database. I may have 10 people doing the same thing. Server may return a list of 5 records, or it may return a list of 1000 records. Also I can hold down a key and this will send few hundred requests to a database - this can potentially slow the whole system down.
Possible solutions:
Timer where I will be able to send a request to database once every 2-4 seconds
Do not return any data unless the value is at least 3 characters long
Return a limited number of rows?
Problem: I'm not sure whether LINQ-to-SQL will cope with the potential load.
Solution: I can use stored procedures, but is there any other feasible alternatives?
I'm interested to hear if anybody else is working on a similar project and what things you have considered before implementing it.
Thank you

When to call the web service
You should only call the web service when the user is interested in suggestions. The user will only type fast if he knows what to type. So while he's typing fast, you don't have to provide suggestions to the user.
When a fast typist pauses for a short time, then he's probably interested in search suggestions. That's when you call the web service to retrieve suggestions.
Slow typists will always benefit from search suggestions, because it can save them time typing in the query. In this case you will always have short pauses between the keystrokes. Again, these short pauses are your queue to retrieve suggestions from the web service.
You can use the setTimeout function to call your web service 500 milliseconds after the user has pressed a key. If the user presses a key, you can reset the timeout using clearTimeout. This will result in a call to the web service only when the user is idle for half a second.
Performance of LINQ-to-SQL
If your query isn't too complex, LINQ-to-SQL will probably perform just fine.
To improve performance, you can limit the number of suggestions to about twenty. Most users aren't interested in thousands of suggestions anyway.

Consider using a full text catalog instead of the like clause if you are searching through blocks of text to find specific keywords. Besides being much faster, it can be configured to recognize multiple forms of the same word (like mouse and mice or leaf and leaves).
To really make your search shine, you can correct many common misspellings using the levenshtein distance to compare the search term to a list of similar terms when no matches are found.

Related

how to get the 1 million-th click of a website

I often heard this question coming from different sources, but never got a good idea of the technologies to achieve this. Can anyone shed some lights? The question is: you have a website which has high volume of users access per day. Your website is deployed in a distributed manner, have multiple webservers and load balancers responding incoming requests from lots of locations. How do you get the 1000000th user access, and show him a special page saying "congrats, you are our 1000000th visitor!". Assuming you had a distributed backend.

You could do it with jQuery, for example:
$("#linkOfInterest").click(function() { //code for updating a variable/record that contains the current number of clicks });
CSS:
a#linkOfInterest {
//style goes here
}
somewhere in the html :
<a id="linkOfInterest" href="somepage.htm"></a>

You are going to have to trade off performance or accuracy. The simplest way to do this would be have a memcached instance keep track of your visitor counts, or some other datastore with an atomic increment operation. Since there is only a single source of truth, only 1 visitor will get the message. This will delay the loading of your page by the roundtrip to the store at minimum.
If you can't afford the delay, then you will have to trade off accuracy. A distributed data store will not be able to atomically increment the field any faster than a single instance. Every web server can read and write to a local node, but another node at another datacenter may also reach 1 million users counts before the transactions are reconciled. In that case 2 or more people may get the 1 millionth user message.
It is possible to do so after the fact. Eventually, the data store will reconcile the increments, and your application can decide on a strict ordering. However, if you have already decided that a single atomic request takes too long, then this logic will take place too late to render your page.

Multiple HITs or ExternalQuestion in Mechanical Turk?

First of all I must say I am totally new to MT so forgive me if I am thinking in a totally wrong way.
I have to create a task for workers where they have to classify a sentence if it is spam or if it falls into a certain category. I will have about 2500 sentences to classify a day.
What is the best way to use the API to do this. I understand how to create a HIT using the API, but it is my understanding that I can't create a recurrent HIT that changes itself once each of the sentence is classified. Do I need to create 2500 HITs?
I researched and found out about the External Question which I can setup in my server and make it change with each form submit.
In that case will it be just 1 HIT? is that the correct way to do this?
I am confused in the dynamic part of MT.
Any tip, documentation (updated) or suggestion will be appreciated.
Thanks!

You likely want to create separate HITs.
If you create an single External HIT (hosted on your server), a
MTurk Worker who takes your HIT will not be eligible to take another
task (e.g. a classification task) since Workers are not allowed to
take a single HIT more than once. However, if you create separate
HITs, a Worker can take as many of them as they wish, which is
probably what you want.
You are correct that you cannot automatically change a HIT
dynamically unless it is run on your own server.

Search result pagination, best practice

I have some results obtained through WS requests from a couple of different providers, then i gather and order the results and i show them at the user.
The number of the results is somewhere between 0 and 60-70, with an average of 10-20.
My problem is: how to handle pagination?
I'm trying to figure out which is the best solution for my situation, because i have find out several ways to do that... and I am sure I am missing other good (probably better) solutions... The solutions i thought until now:
1)Making for each page (15 results) a new aggregated search through the WebServices. This is stupid, but since the average number of results is 10-20, the pagination wont be used often.
2)Saving in the database all the results as a temporary cache and then showing 15 results at time
3)Loading all the results in a single page but showing only 15 a page using a Jquery pagination plugin (client side?)

It depends how big is 1 result, but I'd prefer no pagination if you have max 60-70 results, especially if it's not often. Better user experience.

Are you really sure that someday the web services aren't going to start returning a lot more results? What if someday there is a bug in one of them where it accidentally returns 50,000 copies of the same result to you? In each of your solutions:
A larger than expected number of results would cause you to spam the web services with repeated requests for the same results, as users page through them.
A larger than expected number of results will end up temporarily taking up space in your database. Also, in a web app, how will you know when to clear the cache?
A larger than expected number of results will end up as a huge page in the user's browser, possibly not rendering correctly until the whole thing is downloaded.
I really like option 3. The caching is done at the place where the data is wanted, there are no redundant hits to the web services, and paging will be super fast for the users.
If you're really certain no more than 60-70 results will ever be returned, and/or that your users will never want a really large number of results, you could combine option 3 with a cap on the number of results you will return.
Even in the worst case where the web services return erroneous/unexpected results, you could trim it to the first so many, send them down to the browser, and paginate them there with JavaScript.

Determine unique visitors to site

I'm creating a django website with Apache2 as the server. I need a way to determine the number of unique visitors to my website (specifically to every page in particular) in a full proof way. Unfortunately users will have high incentives to try to "game" the tracking systems so I'm trying to make it full proof.
Is there any way of doing this?
Currently I'm trying to use IP & Cookies to determine unique visitors, but this system can be easily fooled with a headless browser.

Unless it's necessary that the data be integrated into your Django database, I'd strongly recommend "outsourcing" your traffic to another provider. I'm very happy with Google Analytics.
Failing that, there's really little you can do to keep someone from gaming the system. You could limit based on IP address but then of course you run into the problem that often many unique visitors share IPs (say, via a university, organization, or work site). Cookies are very easy to clear out, so if you go that route then it's very easy to game.
One thing that's harder to get rid of is files stored in the appcache, so one possible solution that would work on modern browsers is to store a file in the appcache. You'd count the first time it was loaded in as the unique visit, and after that since it's cached they don't get counted again.
Of course, since you presumably need this to be backwards compatible then of course it leaves it open to exactly the sorts of tools which are most likely to be used for gaming the system, such as curl.
You can certainly block non-browserlike user agents, which makes it slightly more difficult if some gamers don't know about spoofing browser agent strings (which most will quickly learn).
Really, the best solution might be -- what is the outcome from a visit to a page? If it is, for example, selling a product, then don't award people who have the most page views; award the people whose hits generate the most sales. Or whatever time-consuming action someone might take at the page.
Possible solution:
If you're willing to ignore people with JavaScript disabled, you could choose to count only people who access the page and then stay on that page for a given window of time (say, 1 minute). After a given period of time, do an Ajax request back to the server. So if they tried to game by changing their cookie and loading multiple tabs at once, it wouldn't work because they'd need to have the same cookie in order to register that they'd been on that page long enough. I actually think this might work; I can't honestly see a way to game that. Basically on the server side you store a dictionary called stay_until in request.session with keys for each unique page and after 1 minute or so you run an Ajax call back to the server. If the value for stay_until[page_id] is less than or equal to the current time, then they're an active user, otherwise they're not. This means that it will take someone at least 20 minutes to generate 20 unique visitors, and so long as you make the payoff worth less than the time consumed that will be a strong disincentive.
I'd even make it more explicit: on the bottom of the page in a noscript tag, put "Your access was not counted. Turn on JavaScript to be counted" with a page that lays out the tracking process.

As HTML Requests are stateless and you have no control over the users behavior on his clientside, there is no bulletproof way.

The only way you're going to be able to track "unique" visitors in a fool-proof way is to make it contingent on some controlled factor such as a login. Anything else can and will fail to be completely accurate.

Windows Phone 7 - Best Practices for Speeding up Data Fetch

I have a Windows Phone 7 app that (currently) calls an OData service to get data, and throws the data into a listbox. It is horribly slow right now. The first thing I can think of is because OData returns way more data than I actually need.
What are some suggestions/best practices for speeding up the fetching of data in a Windows Phone 7 app? Anything I could be doing in the app to speed up the retrieval of data and putting into in front of the user faster?

Sounds like you've already got some clues about what to chase.
Some basic things I'd try are:
Make your HTTP requests as small as possible - if possible, only fetch the entities and fields you absolutely need.
Consider using multiple HTTP requests to fetch the data incrementally instead of fetching everything in one go (this can, of course, actually make the app slower, but generally makes the app feel faster)
For large text transfers, make sure that the content is being zipped for transfer (this should happen at the HTTP level)
Be careful that the XAML rendering the data isn't too bloated - large XAML structure repeated in a list can cause slowness.
When optimising, never assume you know where the speed problem is - always measure first!
Be careful when inserting images into a list - the MS MarketPlace app often seems to stutter on my phone - and I think this is caused by the image fetch and render process.

In addition to Stuart's great list, also consider the format of the data that's sent.
Check out this blog post by Rob Tiffany. It discusses performance based on data formats. It was written specifically with WCF in mind but the points still apply.

As an extension to the Stuart's list:
In fact there are 3 areas - communication, parsing, UI. Measure them separately:
Do just the communication with the processing switched off.
Measure parsing of fixed ODATA-formatted string.
Whether you believe or not it can be also the UI.
For example a bad usage of ProgressBar can result in dramatical decrease of the processing speed. (In general you should not use any UI animations as explained here.)
Also, make sure that the UI processing does not block the data communication.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js